„The Freiburg Galaxy project" Rolf Backofen, January 09. 2024

This presentation was part of the GHGA lecture series “Advances in Data-Driven Biomedicine”. GHGA stands for German Human Genome-Phenome Archive, you can find out more here: www.ghga.de

I’m actually very happy to introduce Ral because we worked for quite some time together in the past uh in Fr nearly not nearly five years four to five years uh and uh I mean I know your background a bit but the other St so I also wrote it down again

Just um so I’m again gladly happy to introduce R so he studied computer science at the University in alang where he um then moved for his PhD to the University of zand uh and then uh he then worked for the German Research Center for artificial intelligence um afterwards he did a sub

Ucation in Munich um where he then switched for a chair for biotics to the University of yenna but the chair was actually not comfy enough so you switch the chair to you let’s St it this way the they wanted to give me for a different University and a reduction in my

Salary which was not really nice so you you switched the chair to um yeah to to fbook again in computer science um I mean fbook is also a nice city but also better to cycle because one of your hobbies um your scientific or Roth scientific interest is the uh structure

Prediction of proteins and rnas as well as the mo motive detection also with rnas um of alternative splicing and um the investigation of regulatory sequences so everything which is involved also broader topics like Chris Paras 9 ribosome profiling and singon Technologies are in your repertoire and for this presentation

Also the big Galaxy project and for that therefore I give you the word now so that you can present that thank you very much for your introduction now need the slides oops okay thanks for that as FL already said so thank you very much for this

Invitation um this is of course not my work alone not at all but this is of course a big work of the Galaxy team that is also haded with and very sitting here also um so the fror G the project actually got quite big and what I want

To also take the advantage here is to also talk about the data life cycle how Galaxy actually can help in order to get the data life cycle done so fair research and ear these Management in an easier ways let’s say say this Way I think click TR again ah okay thank you okay so this is a data life cycle probably you know that before you start to analyze your data and this is what galaxy is known for oops that was too fast uh is the analys part but I will

Show that Galax can do a little bit more you have first to plan your experiments um you then have to collect data both of course experimental data that you will acquire uh but also public data that might be helpful you might must process this kind of data quality control Etc

Analyze them and then of course that’s what at least if you talk to University Administration is the IDM maybe preserving the data of course we all know it’s not only about that uh because at the end Fair principles are there to share and reuse the data in a in a good

Way so as I said before uh Galaxy some is a lot about analysis but of course all the other parts of of the data life cycle are handled by Galaxy and then it could make it easier for you but let’s go first to this analysis and I want to

Give very practical example because you can see what are the problems and I stumped over this paper for different purposes not for a Galaxy related it Tracer is a single cell technology and that’s actually an interesting idea it’s a lineat Corder that combines uh report par codes together with an inducible

Crisper cast n skrying system and then I will explain the next slide what it mean really is doing there it’s compatible with signal cell and spatial transomics this is actually the the nature method application for that and what they actually do is the falling the introduce

Nror which has a guide RNA and a chfp into stem cells uh or inducible stem cells uh this vetor the guide RNA together with Gris system then induces a scarring in the three in the three M of the gfp and this can then be detected by sequencing so once you have this Vector

You have gfp you can some selected the cells that have the vector encoded uh once you do it you start to develop two cereal organoids and you do then introduce the starring at different time points which allows you then follow up because with how you can say what is

Actually these cells developed into which other cells which gives you then the tree that nicely shows you okay this cell differentiated inter sets which is a nice technology but if you look at from the data point of view it’s getting not so so this is actually the

In ENT about Tracer um so project number and then you have different experiments and different submission there and if you look at overall these are 12 Cal organites contain 66 experiments each of them has three different data sets Associated talk about how to handle them this hum and libraries and so it’s

Compli some ofs and not so if you want to collect all of this it’s actually getting a m and Curr you have to do it manually and this is a problem uh which where you say you have the data according to University administrations RDM everything is there but at the end

You cannot use it if you don’t know how you have to Conn them it’s reconstructable yes but it’s not easily access and this was already observed quite a time ago by Anon neeno and James Taylor um who then at that time serveed 50 papers that used very simp the

Standard Approach at that time bwa for mapping and they found that 50 papers did not provide the primary data nor did the list the versions and the parameters used so you could not really use it and some of them do did not even use the exceptation of the genomic reference

Which means that you would have to consider a lot of different parameters and reference genomes in order to find out what was the exact result to rep and then we came up with the Galaxy idea saying okay we have to do this in a reproducible and transparent way to

Analyze H Thea and this is some workflow for structural variation where we can see it’s accessible for for for also butl people where you can see what is the input what is a tool you can also see the parameters everything is stored you can then store the versions of the

Programs and the parameters together in order to make it simple when you want to use it afterwards and Labs can fully analyze the data themselves by a Workforce of course you have to provide training for that I know this is a danger and we also see we can talk about this kind of

Experience as well that some people are overconfident using this kind of workflows and buy to way they cannot use it but we see more the problem that there’s a lot of data that is not correctly analyzed because they don’t have access to a good workflow and with them they have it and

They are more more easy taking up and I think we invested a lot of training that is the word we see it um the F Galaxy server European Galaxy server FR which was formerly the F Galaxy server is part of the Demi cloud and the European science cloud and the elexir communities

But I will go into details about that okay so um just to mention of course it’s not only a f project not at all as that this is was done by pen State um uh and the whole community and the different service are called use Galaxy Dot whatever um is actually

Thinking of software Services community and scientific applications so they work on improved scalability and performance on external data connectivity on new admin and user Tool uh Services meaning training gpus and highman resources or access protect data um concerning scientific applications you have new Machining learning toolkit which I find

Actually important that that you have easy access to this you have validated public workflows and for the communities you have a lot of citations we want see what what is actually working quite well uh um and for example new video library so there’s a lot of things going on as in a worldwide

Community uh s is it’s basically initially a go for inter inte running tools uh it’s a tool shed with thousands of tools ready to run terabyte of data with thousands um of created reference data there’s a fully featured work functionality a graphical interface which is capable of handing more than

Thousand samples uh about Jupiter our studio I will talk later a little bit and there’s extensive training tutorials to do it and your freedom to do work it either on uh public high performance computational infrastructure or in institutional clusters this is also easy to to actually install or in the cloud

Or your own laptop or even respir Pi if you would to do that okay and there have all the gimmicks you want to have so like activity Parts tool share tools and workflows performance and usage metrics even new data Technologies like spatial omx data uh this comes very um early into Galaxy

Because there’s just the world community that helps to except these kind of things you have training and seams and everything so here is now statistics of the priv Galaxy server and this is already an outdated gal statistics this is I think from April was it from April

Something like that uh now we have at least more than 70,000 what are the current numbers 85 oh sorry 85,000 so it’s always outdated so we have now 85,000 regist of uses we have more than 3,000 tools use Galaxy you has run 45 million at least 45 million jobs and to

7,000 workflows there are more than 170 reference genomes that are linked uh dat 95 million data sets and 1.1 million history so it’s a huge data resource that you can have here we have more than 3,000 active 3,600 active user per month in 109 countries and uh of course we are well

Connected with Alex here um we have 1,300 Plus Alia AI users and a lot of septations what I want to go into a little bit more detailed is this part the training parts so we have more than 300 tutorials in what is called the Galaxy training networks gdm um which also provides draining

Infrastructure as a service so what you can do is you actually have only the people that do the Hands-On but the training material and the infrastructure will be provided by the European Galaxy server uh and this is actually working quite well as you can see from this uh

Catalog here um so in gtn you have you can look at many many different scientific topics it’s free to use for everyone it’s also suitable for self study and we have seen people that are doing this anyone anybody can contribute and updates and new tutorials it’s based

On Fair and open science principle uh when you look at the statistics we have already covered now 37 topics with more with 363 tutorials um there are more than 300 contributors and as you can see the number of contributors is kind of growing linearly U and said 3 37 topics that

Range from a very diverse uh areas like image analysis using deep learning assembly climate computational chemistry Co ecology epigenetics and so on so there is really large um training platform and this helped us for example in the co uh um situation where we couldn’t do on so iners training so we

Could actually use the gdn and do actually online training that work quite well and I think this is one of the things that that should be investigated more uh because having this kind of well established tutorials hubs in order to do this training and it’s also easy that

You can then make a training where you use actually the tutorial that has been provided and then you make your own in-house training using these kind of materials good this also means this kind of growing uh part means that we have also very different types of of uh um

Topics that are covered so initially Galaxy was actually known to be highr sequencing analysis that was where it started from but now if you look at the G what is in Galaxy you have metagenomics Humanities metabolics Material Science pics ecology plant science climate 3D P for so a lot of

Different areas and communities uh which also provide trainings and have also agreement of data and interoperability and this can be then seen in the communities that are sharing Korean Frameworks for the different communities and here is just an uh ever growing list of of of different communities about R eclip metagenomics

High SE Explorer where they actually share a hand framework and agree on what are good workshops hopefully okay so what I also want to go and this we discussed with in this room before is how actually can we distribute the burden that is given um by the computation load uh having uh now

85,000 which is it uses and a lot of things to done and this is the PO the network and um the US Galaxy Community is expanding so there was original one that was us galaxy. org uh pen State and the second one was I think it was the

Second one us galaxy. right we were the second one and we are still the second largest one I guess uh and then we have Australia France Belgium Estonia Spain and Italy uh so it is actually expanding but then of course the question is how do we do with this kind of computation

Load and we had we were lucky to get actually quite some substantial amount of funding in order we built the infrastructure for us Galaxy uh so it was roughly 2 million square year for several years uh but still we need actually help with the competion burden and that’s why actually

We have this put n work where different Computing centers across Europe are sharing their rem remote computation power to support the galaxy. load of course it’s B Cloud that’s is a big part but it’s Ed in Belgian Portugal espain Norway and one of the big one is also UK

Diamond light source that actually providing power in order to deal with this large um a load of of of um analysis task that we were dealing with and this is something that is probably uh always boring but it’s also important so why the use by the Demi we were actually kind

Of how that Force is not the right word but we got the offer to do the certification uh meaning we got the money and then we had to do it it was an interesting experience so we now uh ISO 2711 um certified and I think this it’s a good experience to do this at least once because then you find what are the defaults and what to look like now when we talk about Galaxy and this I you can see in my own um in my own lab actually there always the people

That say but I’m a B matician I don’t want to use this schol right so they say that’s true the redl users they like the go based interface but B usually don’t and first of all there’s are now unified apis for B that they control everything by command line usage I think

That is one way to do the other thing that I think is actually the right way is to use the indac TOs in Galaxy for combining the two words and what are the words so the one world is a Galaxy where you have hardcoded rebuilt tools with a rigid

Interface and the other world is where you have complety freedom Dynamic outputs no rotes like in jupter and RC and my argument is always that when you start to do develop a new tool like the be calling tools that you want to do it on that side right but then you say okay

I’m now using it for very special purpose like I’m using it for doing PE calling for the uh Paris type of tools and then you find out that you have a problem of multiple mapping and then you have to go to the mapping so you actually have to combine it with

A standard workfl in order to make it easily um yeah accessible and then the argument would be that for the standard workx it’s better to use this kind of things and then combine it with only the new parts that you want to do in for example Jupiter because when you first of all

You have recording of everything so like parameters uh what type of uh versions you have used and if you have a new version of the workflow you can just replace it you don’t have to do it and this is then later most of the the time that you’re spending in updating all

These kind of things of the uninteresting point that is one thing the other thing is my experience is when when it is in Galaxy then it’s highly SED when it’s done in your own workflow even if you put in biond it’s not the high so because it’s

So easily accessible for the people and you can do that so here’s an example it’s Galaxy inactive toes and you have here a table and then you run your art shop in our studio you have the output and this is all said and saved in the

History you can use the output that is produced and then you can only make the part that you want to do and you want to deal with in our studio but the rest like mapping and so on or the standard workflows before and after you can use Galaxy and I think that is

Something that should be exploited a little bit more and here are as you can see there are already a lot of trainings in in different areas so for our students we have training materials about our studio Galaxy our base basic Advanced R Galaxy I counts to within R

For climate change there are this galaxy inductive tools draining materials uh even for something like ecosystem simul simulator which I don’t have any anything to say about my topic or something like microbial analysis and species distribution modeling and the nice thing is of course if you have something like

This in different areas you could even say you can combine information from different and I think that is something that’s also nice and interesting okay this was about G Galaxy analysis uh so there are a lot of domain specific tools I said 3,000 plus of different areas like genomics cancer

Research computation chemistry Imaging metabolomics microbiome machine learning plant biology poic just to name a few but it’s also of course important for the processing and of course course you have enough there in the galaxy as well like Quality Control Data cleaning annotations you can import workflows

From workflow Hub or do store you have meta handling uh then of course you have Import and Export so access to public databases uh customized data access limbs integration that is something uh which worked well with our local sequencing facility they have a procu system where they collected the metadata

And this was directly taken into uh the Lim system that is integrated the Galaxy of course for preserving you have exper two artifacts of two different formats like bi computer object and R grade we talk about that uh two different remote um res sources like FTP and

Dropbox and then of course you have the possibility to share artifacts by sharing data sets history workflows uh you have a role based Access Control talk about that you have account cleaning which is also important if you want to actually use the data and you can import

Artifacts so how does the uh Ro based access model work look like well this is an example here so you have a g Galaxy data Library where you have PR private data for group a private data for Group B you have public data which you will

Not count on get quota uh everything is backuped and as you can see here is with the Max blanket was the sequencing facility which where we worked a lot together in in in in sfb uh we had a lips integration that made it easy to get the all the metadata that was

Actually acquired dur the sequencing request um directly into the Galaxy uh to have it available and of course you can if you don’t have this limitation then you can have just the sequencing facility and a direct input by FTP for box on next CL submission tools so uh the problem

That surprisingly comes when you work with Scientists in different uh corporations is they want to submit the data to make it uh accessible again but they struggle to do actually the submission to for example uh reposit like eot because it’s not that simple at least for nonbi formative sheets uh here

The Galaxy group developed an submission tool which is actually where you take and this was done in the sasco submission uh initiative uh which takes the raw sequencing files metadata template you remove make quality controls so remove human reads uh you have your metadata with interactive metadata input you have your credential

And then you have an upload tool which does uh have actually upload the fast Cube files and the metadata to Ena and then you get the access so even that is simplified using this Galaxy and you can use it not only for ssco but of course for general

Purposes now let’s talk about how to store artifacts and probably many of you know but of course what you want to have is actually a fair digital object which was introduced by P vber and some time ago and there was the project Elixir project convert which B also was part of which actually

Initiated uh or tried to standardize life science data management across Europe and they initiated one implementation of the digital object that’s called row grade which drives to cover the connection between the different types of data that you have and if you look at that there are actually many many objects that are

Research outcome and all of them are first CL citizens so when you talk to uh actually University Administration it’s always like they only think of the data as something where you have to do research data management however you have not only data you have papers you have presentations you have system biology

Models you have softwares you have workflows you have protocols and for all of them you have different repositories so data comes to Cod and others papers come to pet presentation to SlideShare software to GitHub uh workflows to workflow Hub and protocols to protocols.io right and that’s also Fair

Because they have a own metadata they’re own repositive so that’s quite okay however they are not independent and of course if you look at the data the DAT is connected to software software is connected to the workflows might be connected to system biology models you have then the public papers or

Publication that references data but also has a presentation to explain it’s also related to protocols that are published so here I have to actually combine all of that and if we go to the it Tracer example just to give you idea how it really looks like you have here the

Publication which is called lineat recording in human cereal organized nature methods then you have the sequence data and Ena and of course uh you have to look into the paper to find the excession number or because from the title you don’t find it automatically so it’s also something to to

Actually repeat this you have the protocol called it plus preparation and it’s stored on protocols aii and then you have software which is inab so it’s reality that you have this kind of different things and they are not really referenc all if you look at the paper you don’t find actually link

To the software um so and uh no to say to the protocols which came out later so what you have to do is actually you have to P all of this so you have to generate an integrated view over this fragmented resources using persistent identifiers and metadata um the rate package has of

Course its own metadata and we talk about that in minut which can be registed and deposited it can be unpacked and activated if re if if needed and uh is a based on schema or annotations in Jason link data and this is a publication that is associated as that

Bur is part of that publication uh here is the packaging so you have a structur metadata where you have an author organization a license uh and then you have the research optic content like directory of data image files links to web resources a GitHub or paper resources stories and uh different

ARs and all of this is then done in an arching file format and package system that you have this kind of interconnection between the different type of datas and uh there is uh from row crate to Galaxy and API where you can actually separate files and workflow metadata can have

Different options to upload signal artifacts or opt out for workface T take all the files if you don’t want to make your own workflow you have different filters and on the other hand uh you can export histories both to a link or to remote file there supported different Ares

Including chip siips B directors and Bs um uh and you think more is this important if you look at this well if you look at this it seems to be easy but if you look at what looks what was the problem for the covid analysis then the

History is a little bit more complex with a lot of different experiments and all this history has to be exported this cannot be done by hand so you need to actually automatic TOS that and I said before European Galaxy server is actually managing three pide of data with 80 million data that’s probably

Take care okay um the other thing is that of course there are other initiatives and the important part is uh and I had discussion about this quite of often because especially it’s it’s some University prob not so much here but our new we had the problem that there was a

Handcrafted way a local way to do the research data management and the problem here is that I don’t believe in local versions and of of research data management because at the end what you need is actually standardization uh because it’s not that you want to store the data you want to

Share it right and sharing the data means that you have to comply two other standards and I just mention you some let me bi Compu objects GA for GH and FDI but we had already disc about that all of this is supported so bio computer is in community Tri initiative the build

A framework for standardization uh it was actually orig developed to satisfy FDA F research and review needs but it’s getting a standard there and there is a support for by comput in Galaxy in the one hand and GH for GH is a Global Alliance for genomic

House course most of you know this uh from s Bo and Ontario initially and there’s also Galaxy um tool register TRS workow search that you can apply here and nfdi we are running out to of time um and FDI we are participating of course in data blant

And bio Imaging but if you look at data BL you have exactly the same problem how to export actually uh research object to rrate uh and you have the same kind of of problems like looking at the data uh life cycles from the experiments wi C data repositor reference knowledge and

Computations and uh here you have actually we are doing on the one hand this import expert to rate and on the other end we have here data studs uh for F labs and researchers okay let’s summarize uh so this was Import and Export of data we

Talked a lot about TOA um of course to workflows to to GH for G in order and so on so we have actually a lot of means to Import and Export data workflows research objects and uh what you have overall is that Galaxy is not only for analysis but it’s a

Whole framework for all parts of the data life cycle so we we try to help with all these kind of things and I think the conclusion is research data management is actually a tedious task no one likes to do it at least if you look at the collaborators uh and has it has

To comprise many steps and artifacts all are connected and this is very of forgotten that you have connected data that you have to connect all these kind of things it’s important to International standards and Community efforts and I think Galaxy is on the one hand uh

For all type of users so it’s important to also say even BCI can profit from them because they can get the standard P out and standardized and have only to concentrate on their interesting part not on the boring part which should takes 80% of the time of making them uh

Updated to the new version um and it’s Wily used more than 85,000 users uh and it can make every part all aspects of a much easier to comply with this kind of standards and I think that is something it’s not easy still something to do but you can make it easy

And of course I want to thank the faly team and especially be Greening he always likes to hide here so it’s he’s here yes he always like to hide and of course without the ger network of pform infrastructure so then would could not have done that that was a big big

Resource and luckily we now this is since so we have now sustained funding for that uh from the B wenberg Ministry of science together with police Ministry of building and forun and I thank them you for all and if you want to get in are some thanks

„The Freiburg Galaxy project” Rolf Backofen, January 09. 2024