Webinar "Gen AI data chain at scale"

Ask your questions using Slido: https://app.sli.do/event/624vAAknoWuvMLZa7F5SX6

===========================

Join the Data Phoenix Slack community: https://join.slack.com/t/data-phoenix/shared_invite/zt-115lu0xo1-KhDX_4xAyEd4JiuiUZ3ieQ
Subscribe to Data Phoenix Digest: https://dataphoenix.info/

===========================

Generative AI workflows heavily rely on data-centric tasks—such as filtering samples by annotation fields, vector distances, or scores produced by custom classifiers. At the same time, computer vision datasets are quickly approaching petabyte volumes, rendering data wrangling difficult. In addition, the iterative nature of data preparation necessitates robust dataset sharing and versioning mechanisms, both of which are hard to implement ad-hoc. In this workshop we will introduce DVCx – an upcoming product by Iterative that separates the storage and processing of samples from metadata and enables data-centric operations at scale for machine learning teams and individual researchers.

Speaker:
Tibor Mach is a Machine Learning Solutions Engineer at Iterative.ai. He has been working in ML and MLOps in the past 5 years. Tibor has a Ph.D in mathematics from the University of Göttingen and had published papers in the field of probability theory prior to refocusing to ML.

Hello everyone I welcome all on data Phenix webinar today we will speak about data about management management data and uh how you can use new product from uh very uh from I can say Pioneers in uh data versioning um uh uh yeah uh in uh interative uh guys from interative

Today we will share uh their a new product and uh how you can use uh uh DV uh for um scale your data set from generative AI um today with us uh Tibor Mark uh he is a machine learning solution engineer and uh he will uh

Present uh today uh this topic um if you have some question if you have question please use sland the link is in the chat and under the video uh and we will answer for your question in the end of him presentation uh for now I give uh to Tibor

Microphone okay thank you for the introduction to mitro um you should all be seeing my screen hopefully uh if not then let me know in the chat yeah yeah good so let me maybe actually before I start really let me introduce myself um as Metro mentioned I’m a Solutions engineer at iterative I’ll

Introduce to company in a second uh before I was doing this I I was working in Consulting for I spent four years doing that starting as a data scientist then slowly drifting more and more towards mlops um and just managing the the entire machine learning life cycle rather than working on individual models

But starting as a data scientist and going there I think it helped me kind of appreciate all of the aspects of ml before that my background is actually not in computer science it’s in mathematics probability Theory I was working as a postdoc for a bit and then I moved into machine

Learning um so yeah so mlops was really interesting to me and that’s why I joined iterative because I actually started learning uh and using DVC and other tools from iterative before joining uh because I really like the idea behind DBC I like this new product

DVC X as well because it answers um to a demand which is rising I would say right now and also it’s kind of adding the full picture to mlops so with DBC I will also go through that a little bit I will for those who don’t know DVC I will

Uh give a bit of an introduction of of that part of our products as well and then I’ll move to DVC X which is the you know gen llm it’s not just for that it’s generally for managing large unstructured data uh but of course right

Now all the hype is around llms geni so this is also what we will kind of structure this around a bit and there’s a link as you can see in this very simplistic picture there’s a link between DBC X and DVC this is also why I

Want to mention DBC um but dcx can make sense on its own as well for certain use cases it really depends what you use it for um so let me just jump actually let me just open the full slideshow or not I don’t know seems to be okay now

It’s working great so let me let me just give you an overview of all the products that we have at iterative uh so as mitro mentioned we have DVC x uh that’s an upcoming product I would say rather than a new product it will the release should

Be quite soon uh probably in December uh so you get a little bit of a preview today uh some things are still being worked on it but it’s it’s very close to to a released version uh we have DBC which is our first and I would say main product

In in in a in a sense or up till now at least and then we have assas offering you know software as a service kind of to wrap it all around so if you think about these three things it’s abcx is mostly for large data set uh unstructured data specifically uh

Creation of large data sets and the X I mean to me it stands for not a new logo of Twitter but extra large basically so you have large and structured data and when I say large I mean millions of images even billions of images that kind

Of scale when you actually want to do the Gen AI models or maybe you don’t want to do geni models but you still want to um curate large data sets like that I will give an example of a it’s not an example of an actual customer but

It’s very similar to an actual customer that we’re kind of piloting dcx with and uh this might give you some ideas of of the use cases that might be interesting U once you kind of do this uh in the scenario where you have large data sets

Uh but then you want to train relatively small models but perhaps on accurated data set so you start with a lot of images you need to filter some of those because you want to fine-tune your models further and you want to use maybe something like reinforcement learning or

Or you want to really fine tune the specific aspect of your model you have a lot of data but you need to kind of find the right data uh for your uh for your models to get even that little better you know now or maybe a year ago um or

So all the buzz was around data Centric uh AI you know how creating data is really the most important thing right now this is also something that DBC X can help you with and then you can transition to something like DBC to go on so DBC you can think of so while DBC

X you can think of it as something like uh spark for unstructured data it’s a simplification but you know I’ll show you how how that actually materializes and DVC you can think of it as something like terraform uh for machine learning so the whole point or the main point of

DVC is to make uh it possible for your machine learning uh to use gitops practices to have git as the single source of Truth for all that you do in your machine learning from the raw data for your models all the way to through the experimentation all the way to a

Model registry and um you know assigning models to stages to production Etc uh so the same way that terraform does this for infrastructure DBC does this for machine learning DBC is mostly free open Source software if you have if you’re a single developer uh or if you have one or two

Projects you probably are okay with just that if you have a larger team uh or if you have multiple projects uh or if you have both which is often the case uh then it uh also is useful to use uh our uh software as a service offering which is actually which will include

Dcx and which now which is called Studio it’s called a DBC Studio it has basically features for collaboration for sharing experiments life tracking and sharing those experiments with your colleagues it has all those BS and whistles of all the experiment trackers that you’re used to but it also stays

Focused on this gups approach it still keeps git as the single source of Truth uh for everything but it provides you with this environment that you can collaborate in easily even with less technical users um it also now has more and more features which kind of help you

Manage credentials so if you want to access models you can do it through Studio you don’t have to share uh for example access keys to ms3 buckets or something like that so I won’t really talk too much about Studio I will show it though because dcx runs in

Studio uh so uh we will we will see an example of how dcx works now uh I will just go very briefly now through how what DBC does because it’s it’s kind of illustrative for what it can do and for where maybe it has some shortcomings which are addressed by

Dcx uh for this particular scenario where you actually want to uh use those large data and then train some models on those uh just very briefly all of this as I mentioned it it kind of runs on your infrastructure if you work that’s pretty much true of all of those tools

You can use it on any of the uh major Cloud providers you can use it on Prem as well uh since this is very tightly couple to get at least uh especially with dcx um sorry with EVC uh you can use it with any git Forge gitlab GitHub bit bucket and while dcx

Is mainly aimed at unstructured data DVC is kind of agnostic in this sense so you can really use it for structure data unstructured data anything really if you work with a lot of of if you work with big data which are structured you would probably use something like spark uh if

You have big data which are unstructured which has been more and more common recently then this is where dcx comes in um all right so let me let me move on and let me just start with a brief intro of DBC uh I’ll give this maybe 10

Minutes and then I’ll jump into DBC x uh with the context that we get from this so DVC yeah that’s our old right it’s still it’s still it’s still obviously the main product but it’s the first thing that that we uh uh that we created maybe Flagship of sorts until

Now at least and the way it works is as I mentioned it kind of works like terraform for data but you can also think of it as being git likee so it’s not replacing git in any way shape or form uh it’s actually working together with Git uh but it’s kind of replicating

The get uh way of working with your data so what it really does is for example let’s say that you have a machine learning model you have a ml project you create a model and you know it could be one of these Transformer models which are quite large so 500 megabytes is very

Very possible for a model like that it could be even more so that’s not something that you want to version with get right so what you what what do you do um what you would do normally if you use something like say maker maybe that’s kind of handled by that uh but

Then you’re kind of fixed to that platform or you could just say all right I’ll save my model to S3 or my NZ drive or anything and I’ll just save the uh path to that model in my git repository in in like a metadata file and there

You’re getting a little bit closer to how the EC works but it does some things on top of this because it kind of does the versioning for you so there is no uh danger of people making mistakes there’s also what DBC does it has its own cach

The same way git does have a cache and it versions that it pushes that to a remote which could be that S3 bucket but it’s managed by DBC so you don’t have to worry about rewriting that path to S3 or maybe if you verion get a data set uh

You don’t have to worry about somebody deleting 100 images and now that you’re trying to reproduce user model with the same supposedly the same data it gives different results and you don’t know why so this is kind of all handled by DBC and since it works in a g-like fashion

It basically looks up it works with hashes the same way basically that git does uh so the effect of that mainly is that first of all everything is secure you really know that it’s that one file uh and it doesn’t duplicate anything once a file is versioned on the

Remote uh even if you use it 500 different comets it’s still going to refer to that single file so this again makes it better in terms of uh reproducibility and the connection to get is through these metadata files that DVC creates automatically and these are actually

Versioned by git and stored in your git comets unlike the actual files uh and these metadata files are the link that DBC uses to kind of pull the data from that remote um locally now this principle uh is then kind of propagated to machine learning pipelines uh I won’t

Have time to really talk about DBC pipelines today uh but let me just say there’s a there’s a there’s quite a simple way to create pipelines with DBC where you can structure the stages of your project uh and it it works again in a way that captures all the inputs and

Outputs of all the stages and uh it version them automatically uh the same way that I just described with data or the model that you would version uh manually so here is an example you that if you run a DVC pipeline you end up with this lock file and you know as

You can see here I have the hashes of all the inputs of my stage data split here uh the same goes for the outputs of those stages and the same goes for older stages so in the end what I end up with is a fully reproduced usable ml pipeline

Uh which is versioned in my git repo without having to actually version the data uh by git which is usually not a good idea for anything that’s you know more than a few uh megabytes really um right uh and also it has some other features that that are cool about

This so for example you can skip stages because you know that the outcome of a stage is going to be exactly the same in some situations because of all the versioning I won’t have time to go into that maybe if there are questions I can briefly touch upon this uh and then the

Third part of this is that the same is again applied the same gitops logic is applied to the model registry so the model registry which we have in that studio platform that I mentioned um it allows you to manage the life cycle of your of your models once you register

Some experiments as as actually version models and all of the actions in that that model registry are captured as git tags so in the end they still live in your git repository you get the fully auditable history from there uh and uh you are also because of this you’re able

To connect the model registry actions very easily to your cic cicd because every action corresponds to a tag and then it’s a very easy step to you know for example write GitHub uh action work close to deploy your models for example or build a Docker or do

Anything really that you want so that’s kind of the philosophy behind EVC uh that you can really apply gith ups at every step of the ml life cycle uh but there’s um some there are some limitations uh uh of DVC and one of the limitations is that DBC needs to work

With your data on that on on the machine where you working uh to version it so it stores the data in a cache the cache is pushed to the remote but this doesn’t really work all that well if it works okay if you have thousands 10 thousands even

Hundreds of thousands let’s say if we stay with images it doesn’t have to be images but we’re going to be mostly talking about images today um so let’s say you have a few tens tens of thousands of images it’s it’s okay you can version with the with DVC you have

Direct access to it it has all uh a lot of lot of advantages but once you get to the scale of millions or or or even billions of images this is no longer practical you can’t possibly have a machine where you can download all of

That data and even if you could it would be expensive and slow you need to work uh with data like that in a parallelized fashion the same logic that applies to spark you don’t want to use uh you pandas if you have um let’s say banking transactions uh from your bank for the

Past 10 years it’s going to be terabytes of data that you just can’t process locally you want to process that in a parallelized fashion and maybe turn that into something smaller then you then use with your machine learning and the same thing you can do with uh DBC

X so and there’s another aspect of this so I think a lot of you have probably seen an image like this so DVC uh in many ways uh is mostly helping you with the ml modeling you know improving your algorithms making their making sure they’re reproducible and all of that and

That’s important stuff obviously if you’re trying to train a model but if you have any practice uh with data science you know that really if we just go by time you spend working on something most of the time is the boring stuff but important stuff of cleaning

The data and if you work with something like CV models a lot of a lot of that is also labeling even with other context but especially in in CD so then this is also something where DBC x uh is supposed to help sort of with those 80%

Uh of time that you would otherwise uh spend uh that you would spend uh working on anyway uh the difference here is okay you can use tools there are tools to help you label data you know there there’s a labeling studio uh there are I don’t know I don’t even probably know

All of those tools um and the same goes for like data analysis things like that but the difference comes in now with these large data sets that are being used in gen AI llms basically Transformer models that’s what most of this ends up being these are large

Models they need a lot of data and they can actually make use of a lot of data to get better but that means that for example labeling becomes a challenge because at now at this point um you can yeah okay you could hire a lot of people

Uh and let them label your data set but if you have really if you have hundreds of millions or billions of images it’s going to be very costly to do this and it’s going to be very costly to make this in a consistent way uh so

There might be better ways to do this one way could be uh to maybe just label a little bit manually then create models uh that can actually do the labeling for you and then use that in your data chain as an automated labeler which is then going to be consumed by another model

Down the line um or in fact it could be the product itself uh but you want to really label the entire data set if you have it otherwise you’re throwing away part of the data that’s your gold uh that’s why Facebook uh is for free mostly uh because they really make so

Much money on that and you want to make use of that too but uh this for this you need scaling um so yeah if you have a structured data you would probably use something like spark in a lot of cases for unstructured data spark is not all

That great and this is where DVC comes in and we I’ll show you some examples in a moment and this cleaning uh that’s another part of it so basically you want to query very very large data sets you want to add some signals to that data

Set and you want to kind of bundle it up uh what do I mean by bundling it up so here is a Anonymous website from the internet uh you probably won’t recognize yeah okay so it’s Amazon um and uh you know this is just a screenshot from a

Random website on Amazon random page on Amazon so there’s a vest uh a woman’s vest and there’s quite a lot of data here that if you think about it well Amazon definitely is going to work with this uh Amazon is not one of our customers but we do have a

Customer uh we are piloting dcx with which is actually similar size well not similar yeah it’s rough the same order as Amazon and it’s also an online retailer and I’m not going to mention them exactly I’m not going to mention the exact use case so let’s just but

It’s very going to be very similar to what I’m going to be describing here so let’s say that we have this eShop or even AG aggregating many eshops uh and we have these images of of items right so the first problem we might have is we have we have

Images we have structur data you know like the fabric type that could be a categorical variable maybe uh or origin could maybe also be categorical we but we also have a lot of unstructured data we have ratings which are reviews and they’re completely free form uh we have this these product descriptions which

Are provided probably by the producer of the item or by the distributor we also have other images so for example images of the same product maybe from different angles how do we find which are which and then we also would like to show our customers something that’s related that they might

Like if they like this product right so how do we go about that so obviously there are many ways and this has been done for a long time uh to varying degree of success but now with those uh you know ji tools and with those large

Models you can actually do quite a bit more than than in the past so actually in my previous job we had a similar customer to Amazon which was aggregating uh aggregating data from various stores and one issue they had was making sure that they don’t list the

Same item multiple times so if you if you can buy it from several shops they they would want to kind of aggregate that and show all the shops you can buy it from but that’s actually quite difficult because you can’t just go by ID necessarily because the IDS can be

Different a lot of that is going to be wrong uh so there are many ways that you can put it together you can go by the the item description see if those are matching you can do all of these things but the best thing you can do is to do

All of that at once and also work with the image data uh and some kind of a similarity search between the product uh but if you try to do all of that uh it’s going to lead to good results but once you want to scale it up at a level of

Amazon or even something slightly more modest than Amazon uh on a regional level you you’re going to hit uh a lot of blockers if you don’t have parallelization and concurrent uh and way to run uh you know all of these queries at scale one other thing that

You’re going to run into is that you really need to manage all of those different data sets and they kind of need to be put together so for example for this item I would like to uh have a way to say okay so this is that image it

Also has all these reviews attached to it it also has this description and it has these images which which are similar and I want to kind of keep it all tightly coupled because I want to use it as a single data set that I ITR

Upon um so so that’s one thing and what I want to do with it once once I actually put all this information together I might want to find similar stuff so I can actually recommend all these other products or I can actually go with this way and just Show sponsored

Stuff that’s that’s another option uh but even Amazon is just not showing random things it’s even though some things might be uh promoted they’re clearly similar to the image of the of this of this item so yeah so you want to do that you might also want to remove

Something so for example you might have by mistake some images of naked people that you don’t want to show in your ESOP or you might have some reviews which are spam or or something that you really don’t want to show um and maybe it’s not just Spam you know spam is easy to

Filter but maybe it’s something that’s just unrelated to the product uh that that you don’t want to show so you really still need the context of all of this other stuff related to the it 10 uh if you decide whether to show it or not

To show it um and once you do all of that and you want to do it at scale right and once you do all of that you want to version these data sets at least if you want to use them somewhere for ML down the line or for any really kind of

Uh AIT ability or reproducibility um let’s say that I use all of this information from all of my uh images from all of my clothes uh that that I have pictures of and descriptions of and I train some models down the line it could be a model like fashion clip

Which I’m going to show in a moment uh which kind of is used for finding similarity between combining unstructured textual and image information um but then a new collection comes in the winter season and I retrain the model uh on that new data or including that new data I want to be

Able to again go back and see that okay this is the data that actually were used to version this model so that I can compare them between each other you know you remember that the whole deal with DBC was to version everything to really have completely reproducible pipelines

Uh with dcx you don’t want to lose it so you need something like that as well or with dcx with with anything that you use for data set creation yeah all right so that’s that’s kind of the motivation that’s what we want to get to so let me just shoot so

This is a very high level picture and then I’ll jump into the UI of of dcx in a moment uh so from the high level uh you have some data you store them on your local storage you can have them in uh or you can have S3 Azure gcp anything

Like that as I mentioned we’re kind of agnostic here and then what DBC X adds to it is a metad DAT layer uh which kind of helps you work with that data without moving it anywhere it just creates a basically metadata layer on top of it as

Sort of a data catalog but a little bit more than that because it allows you to add more information to combine all of these different sources of information and add new signals um which is probably best explain if I show it uh in a second

And it also gives you a uh data set manager a UI where you can do all of these things and you can also use it to automate all these jobs when you’re once you’re happy with the results and then you publish the data set you export them either for any consumption of any

Algorithms down the line it could be that you have some kind of a collaborative filter you know for example here to show these uh related products or you actually go and train some some models on on top of those data Downstream so let me actually switch now to um the actual

UI so let me just very briefly so we don’t have that much time unfortunately but let me very briefly show you where the UI comes from so this is our studio platform so normally it’s actually or so far it’s been used for DBC projects where you kind of import the repository

Of your with your models when you uh as a project when you open it you can look at experiments uh compare their metrics all of that stuff that you see elsewhere the differentiator mainly here is that everything is actually u versioned in your git repository in the end then you

Have a model registry where you can see all of the models you can add stages to them you can connect it to uh cicd as I mentioned you know you can go to individual models but that’s really not what we’re going to look at today um

About two weeks ago I had a talk at an melops World Conference I was talking just about DVC for 90 minutes and I still couldn’t cover everything so here we are very very brief um but there’s the third bit and that’s the dcx so here

Here we have a demo team um so you’re actually in studio you’re organized in teams so for example you can be a part of several teams here I actually have the dcx platform under the data set stop and if I open it you can probably immediately recognize this as

Reminiscent of some of these like uh SQL managers and that’s actually not a coincidence this is supposed to uh mirror that approach this is supposed to be for data preparation exploration but you then uh create maybe Chron jobs or automation that kind of uh versions and

Curates those data sets for you but you start with doing that man ually exploring what you actually want to do how you actually want to curate and you start with data sources so here we have an S3 bucket and as we saw in that image

U just a just a moment ago uh it can be also Azure it can be locer storage anything but here we have an S3 account on that S3 account we have multiple uh buckets each of those buckets contains a data set uh we have some example data sets here

Uh which are used for demoing uh for example we have the lion data set which are actually which is going to which I’m going to use now uh the first thing that you want to do is to index the data set what it does is it actually retrieves the information

From that storage so for example from uh from uh AWS here from S3 it gives you some basic metadata information uh and it creates a table uh or it makes it available for dcx to work with so I would just I can just create an index

Here I could ALS also P Json pairs between for example images and masks I’ll actually do it in a different way that I’ll show you in a second so let me just create an index um so yeah now it’s indexing which means that in a while I think it’s going

To be about 20 seconds or something um we’re going to have the data set uh the bucket index and then we can start working it with it in this kind of a sequel like or spark like fashion what we use here is a library called dql so dql kind of sounds like

SQL sounds like it’s a it’s a language it’s not a language actually it’s a python Library so you don’t have to learn a new language you can just work with python but otherwise it has a lot of methods and functions which are um which are uh you know similar to how you would

Do things with SQL uh so for example you know you can you can run some basic operations like filtering uh your data sets by name size Etc um but you can add a lot more actually so this is why I’m here in the Json pairs um and what you can do with

It is okay so this indexing taking a bit longer than usual it’s maybe because I’m want to uh I’m actually now restricted to a single machine normally we would work with a cluster of machines uh right now I’m on a on a single machine so things are bit

Slower uh but once this is indexed maybe I just need to refresh it uh once this is indexed uh you it’s available for you to work with and you can run a query like this so you know if we look at this basically I just mentioned the bucket I

Do some filtering and here’s the interesting part actually where I add signal uh I have some Json which are uh related to uh my images and I want to pair them so I want to create essentially a metad DAT layer on top of my data set of images where I

Have all the information not just about the images themselves but also uh the stuff that’s related to all those Json that are attached to this so This is actually if I run this query and I did that today so you already see the results of the preview of the results

Here then you get something like this uh you have this metadata layer on top of the images so we have like 50,000 images here here we see the first 20 we can look at some of those images yeah so this is a lion data set so it contains

Actually quite a lot of different images it’s not all that specific some of those are clotes um and uh you know it gives you all the information that you would get from from indexing the data set it also gives now information that we got from these Json

Pairs uh uh so from the Json files that were attached to those images uh and we can actually do more than this we can write our own udfs and we can um add new signals using machine learning models this is what I was hinting at with the AO labeling at the

Beginning but let me actually before I do uh before I show you that let me actually refresh and see if somehow the indexing uh finished so this is the bane of live demos of things which are not released yet but fortunately we don’t have to actually worry about this right

Now uh because I rent this already and the results are here here here’s the preview but if I now wanted to actually persist this data set with those Json pairs I would just click here on register data set I would give it a name I don’t know where is this lion with they

Uh and you start it starts versioning automatically so it’s offering the first version I could rename it to something else and you can add some tags and descriptions so I’m not going to do that now then I just click on register data set and it’s going to create that data

Set for me and as you can see here on the left now in the data sets uh it showed up and uh we now have a first version of this data set we could then export it this is actually um uh this is actually not uh shown here

But it’s in the in the development version right now uh where I can export a data set which is where I can then use it with for example with DVC or something something else oh now it looks like the indexing finished finally um so I could use it Downstream

For training ml models right so here what I achieved is I took all the the large data set I just picked the JPEG files from that I added all the Json uh information from the Json files I packed it in a single table uh that can be then

Used by anything Downstream to reference all of those uh images and I packed it as a data set and the data set is really nothing else than a table with references that I can use elsewhere but the references include all these extra signals all this extra information that

I got from the Json now Json are nice but I want more and what I want is I want to actually I mentioned fashion clip right so this goes back to that image that we had where we uh took the the actual image we took some textual description we want to combine it

Together we want to use uh I don’t know if you know the clip uh model basically uh it it’s a way to kind of work with unstructured data together with text and with with images at the same time and produce some interesting results in this case we’re just using an flipse core and

We will using we can be using it uh to find similarity between images so for example when I want those vests which look very similar uh then I can use the F clip but again in this case I want to run it at scale I want to run it on my

Entire data set uh and this will be possible here because I can run this in a cluster like I mentioned actually right now I’m on a single machine but you can in dcx set up a cluster of 8 10 20 machines and uh these can even be GPU

Machines making it even quicker and I can add signals to that so I can take that data set that I just created in fact this is a different data set but doesn’t matter too much this is data set from zando that’s a European I think it’s a German uh clothes

Shop and uh yeah and I can add this fashion clip score to it so so that now I have a data set which not only has all of those informations uh all of that information but also uh this F clip score so I can then uh maybe find use it

Downstream for some machine learning or for collaborative filter for example um all right so then again I would run this query I’m not going to do that because it’s not going to create anything else than what’s already here uh but um then I can again register this data

Set as something new so in fact I think this was registered oh it wasn’t so I could create a new data set or I could actually uh yeah I think it was registered I think it was the lay on Json so I can actually add a new version

Of that data set I already have 11 versions here apparently so I can create a 12th version uh yeah why not um and again you know all of this stuff is version I can always look up the qu I can also look up the query that create

Uh the the data set uh and I can of course look at the results or export the results so how does this work uh why do I have fashion fashion clip um this is because dcx a little enables you to use udfs uh you can import them

In the settings where you can just specify you know the requirements uh but in this case I just copied it here so we can actually have a look at the UDF so you actually use fashion there there are two udfs here uh and since this is all python so you just

You just have the normal fashion clip class that you that you would you would have with that model and the only thing that you really attach to it is this uh decorator which kind of tells uh dql and also so therefore also dcx as dql is

Just a library behind it uh that this is supposed to be used as a UDF and this can be used in queries and it tells it how to use it in queries um like the one that we just made so for example here I’m saying okay give me the F clip score

Uh from that model uh do it in batches of 100 and initialize this model uh on each of the machines that it’s going to run on uh you can use pretty much anything any kind of UDF that you can think of and make it a part of your workflow here

Make it in a parallel life fashion you know on all those machines so then you can really do this at scale um or instead of udfs you can use the stuff that’s already included we will include uh in the release we will include some basic um uh similarity search and stuff

Like that so it it will be it there will be some stuff that’s included that’s even more optimized and you don’t have to write but if you have something that’s missing you can always write a UDF and use it in your query to add to those signals so now we could add those

Json pairs we could add the scores from fion clip or any other model and then we would maybe version the data set and and and Shi ship it maybe we want to do bit more maybe we want to make it smaller we just want to reduce it to some uh you

Know specific uh images so you can just query as you would in something like spot AR or or SQL you have for example a filter method which is going to allow you to kind of zoom into something specific so for example in this data set uh I just take

Things which are large enough and then our they have a let’s say this pun safe is kind of uh a measure of is this maybe sort of like as they say no not safe for work images uh so I just deliberately picked those that have a high score we look at one of

Them yeah so it’s like yoga uh so apparently the model has some suspicion about this uh being something that is not safe for kids and and this one is even funnier actually it’s just naked truth so that’s that’s that’s bad uh so you can you know if it’s actually

Something that you want to filter out you can get rid of this and keep only the stuff that you want to train on or that you want to use Downstream uh one other thing that you can do if you have a specific image that maybe you know that your models don’t

Work on you can have you can add similarity search as I mentioned so for example here again we can use fashion clip to just get things which are the most similar to one image and again run all of this uh at scale so for your entire data set you just pick the most

Uh similar images here we’re limiting to 20 just for you know demonstration purposes uh so I have this uh yeah okay what is this image I can show you that image I think uh that I used here yeah so it’s like a it’s like a screenshot so it’s

Not it’s not really a picture of someone in clothes it’s like a screenshot from a website and it’s going to show me the most similar images yeah so I don’t know this one yeah so it’s also a screenshot so maybe I want to get rid of these and

I just want to keep people in clothes not the entire uh eShop uh so we can do all this things like that you know you can you can see that you can do all of this you could do it with without dcx but what dcx adds here is the uh

Scalability and the way to combine at that scale of billions images to combine all of that information into this metadata data set that you can then reuse uh that you can export and use Downstream uh the last thing I want to mention before we finish because I’m

Running out of time sadly is Delta updating so you can also as as you add new as I mentioned before you can have this situation where the new Collection comes in right so you want to update the data set but you already have I don’t know 50 million images and now you have

A million more uh and you want to add all of these signals you know from from Fashion clip you you need to you want to pair adjacent again uh or you can it can be a lot more than just a fashion clip you can have 10 20 models that’s

Actually in fact that customer that I don’t want to talk about directly but they have something like this where they really have a lot of models that each kind of give a specific description of the images so they can sort them out their big problem is they don’t want to

Do it uh on the entire data set all over because that’s even at even if you can do it relatively fast with concurrency you don’t want to waste the money and the compute resources to do that every time right if you have five 50 new images added to few B

Uh so you can actually use Delta updating right now it’s actually sort of let’s say bit manual uh where you kind of just query the old data set you subtract uh the old from the from the new ones and then you run uh all these U methods like adding those signals only

On the new one then you put them all together so that basically saves the resources uh in the release it’s most likely going to be automatic where you will have Delta updating and you will also have automatic reindexing as you add more stuff to to the buckets so you

Can then create and iterate new versions without actually having uh to uh remember to do all of this manually to have people create those data sets right so that’s that’s basically it so DBC X really it kind of works as something like you what you could do

With something like big query or or or some other uh methods but it adds these extra features where uh you can create all these udfs create more context more information put together all of this uh metadata to your images and create this finalized data set that you can then work with down the

Line it integrates it will integrate well once it’s released with DVC so DVC will be able to import the results of uh DVC X and work with it down the line um yeah and uh and it handles a lot of the stuff that you would want to do uh about

The creation so we can handle that stuff uh automatically for you all right so I’ll stop here uh and I’ll have a look at the question well we have quite a lot yeah thank you for your presentation and to view and it’s really very similar uh by the interface to the big

Quy yeah um so um if you have some questions please uh ask the question using slander and we will uh read these questions uh you can find the link under the video or you can uh scan this QR code uh and uh ask the question uh first uh I have um

Also the question uh as understand it’s not possible to install uh on my own environment for free uh it’s not open source product that’s a great question uh so or will be dql uh so I mentioned that we have dcx and then we have dql so dql is the library behind

Dcx uh but dql is going to be open source so you can run it in your environment you can run it locally uh but kind of the nature of the product is you can do it so the same way that you can run spark on your machine you know

But usually you only want to do that for debugging or trying things out because you want to run this on a cluster to really make advantage of all of the features uh if you have something that fits on your machine uh and then you don’t need to parallelize then you

Probably don’t need dcx uh so kind of that’s that’s the idea so you can use dql you can test things you can even develop and then you just export it and you can use it with dcx uh to to Really upscale everything uh but yeah in order to use the

Concurrency and all of these automation tools all these cron jobs and things like that that I showed and the UI itself uh you would have to actually use the paid version yeah yeah understand um one question we have uh about uh you’re switching from math mathematics yeah why you

Switch yeah that’s that’s that’s a great question yeah that would be for a longer uh talk I guess but and you’re right you know getting in is is a a great place to study mathematics uh I’m not sure I actually mentioned that I was there but

I think maybe it was in my bio uh but uh yeah I mean in mathematics you really have to be either like top few per in the world or you kind of have to go where you find a job where they offer you a a long longterm

Opportunity um and uh because it’s you know the academic Market is bit scarce in this regard so that’s one of the reasons where I decided you know maybe I I kind of don’t want to go to well nothing against New Zealand just as a example of a place that’s far away I

Don’t want to go to New Zealand just because they offered me a position there and I don’t want to lose that yeah so that was one of the reasons yeah in Academia it’s sometimes you need to go to some new country new Institute because only on that place you have some opportunity

Yeah yeah so I think there was another question in the chat what’s the name of the product you are promoting yeah so so maybe that was a bit confusing so we have D DVC is one product uh that’s a free open source software that’s the melops thing you

Know kind of like the git Ops uh Tool uh there are other libraries related to it but we kind of all just call group it and call it DVC um and uh then there’s the studio DVC Studio which kind of uh puts together both DVC and DVC X that’s

A paid product that you uh can use to kind of give a uh for for collaboration you know and for for especially with dcx for all of this concurrency and all all of these features uh that let you upscale everything and the dcx is the new product that’s not yet released and

That will be released in probably in December and will be included in DBC studio and that’s the main thing that I was talking about yeah how you generate embeddings of images how how long does it generate yeah uh so that you know how long it really depends on uh how many images you

Have uh yeah but uh for I see for hundred of millions of images okay so it depends still on the cluster that you create um so if you create like I said I was not really running those queries because I’m uh in my account now limited

To uh a single machine uh a single machine cluster so things are relatively slow that’s really not the use case for dcx instead you would want to run depending on how many images you have you could scale it up to 10 20 usually not that much more than that

Uh it also depends obviously on how big those images are and how complex that uh algorithm is for your embeddings uh so I can’t really give you a precise number but you can think of it this way so if if I have one machine uh it’s going to take me let’s say let’s

Say it’s going to take me yeah when you were talking 100 Millions realistically it’s going to take me two weeks potentially you know if instead and that’s the same if I just run it locally but if instead I upscale to 20 machines I’m going to have it done

In less than a day potentially if I have GPU machines it might be an hour you know but uh obviously that’s going to cost more but if if I want to iterate quickly uh I don’t want to wait two weeks between each iteration of my ml models that maybe depend on that data

Set so it’s the same as in this regard you can think of it the same as with spark if you run on a single machine it’s going to be slower uh especially you know with a with a large data set if it’s a small data set then maybe no uh

And uh if you have a large data set you really want to scale it up yeah um so so uh I think uh thank you for uh this presentation uh thank you for uh both with us and Shar your knowledge uh and your produ U product um to do some preview of your

Product it’s uh very cool that U you share what you will released um thank you for all who was with us today and who asked the questions uh if you will have more question uh let’s continue discuss in our select channel uh and see you on the next webinar our next webinar

Um all right thank you very much and thank you for having me yeah so all right then bye

Webinar “Gen AI data chain at scale”