HeFDI Data Talk: NFDI4Earth - 03.11.2023

The presentation slides of this talk are published on Zenodo under: https://zenodo.org/records/10074524

The NFDI4Earth started almost 2 years ago, is one of the largest NFDI consortia, and comprises all domains of the Earth System Sciences (ESS), e.g., the deep Earth, marine sciences, climatology, geography, geoinformatics, or planetology in general.

A core product, through which the NFDI4Earth serves the community, is the OneStop4All, a web page that addresses the needs of the researchers on all levels. This webpage is currently in the last stages of internal reviews and will go live in the first half of 2024.
The second core product are ways for all researchers to participate in the NFDI4Earth: Interest Groups form by initiative of ESS researchers e.g., for high performance computing or research software, the Academy for early career scientists that organises e.g., networking meetings or hackathons, funding for topical research data management (RDM) projects as a pilot, incubator, or an educational theme, and online ESS training material through its EduTrain. The connection of the NFDI to state initiatives such as HeFDI is vital for the increasingly important field of RDM.

Dr. Dominik Hezel, Goethe University Frankfurt (Institute for Geosciences) & PhD student Tamanna, Goethe University Frankfurt will present how the NFDI4Earth works, provide an example from a pilot, and report on the progress of a joint project with HeFDI that uses Jupyter Notebooks. The latter are becoming a quasi standard for code exchange. Ways to use Jupyter Notebooks even more quickly and conveniently via JupyterHubs are currently investigated.

***
The HeFDI Data Talks (https://t1p.de/hefdi-data-talks-2024) are a bi-weekly open information and discussion event on the topic of data management in the context of science, at which relevant NFDI consortia and research data management services present themselves. Current topics are discussed in the series and numerous tools and services, including local and regional ones, are presented. The HeFDI Data Talks are an offer of the state initiative HeFDI, which is funded by the Hessian Ministry of Science and Research, Art and Culture (HMWK).

If you have any suggestions or feedback on the topics, please contact the HeFDI office (hefdi@uni-marburg.de). If you would like to be regularly informed about our offers and events, you are welcome to subscribe to our newsletter (https://t1p.de/bgkfl)!

so welcome everyone and um I said I’m presenting about the nfdi for Earth the nfdi I think I don’t have to explain this in too much detail in this audience here is one of the consortia of the national research data infrastructure which has been established just a couple of years ago when German dis government um discovered it need to do something about this digital stuff that’s going on and is likely here to stay and the en icon for Earth is one of 28 now I think consortia which were funded in three rounds the third round is now now closed so we are in full swing basically now so just very brief outline of this talk because as Maria said we are two persons today so I will talk about the nf Earth give a little bit an overview what it is I will just very briefly touch on tubor hubs um because this is a little bit part of our sort of cooperation between the nfdi for Earth and Hefty because we have Tam sort of a little bit of connection between our two initiatives here and I give a research data management example from an nfdi pilot I will talk I will present you what a pilot is and then you will know what it is and I will give you this example and tamana will then give um a specific example also from from data science and why and this will emphasize also why nfdi and heft are such important and vital initiatives so the official start of the nfdi was on the 1st of October of 2021 so almost two years ago but the full start with everyone on board and so on was about January to March 2022 so we are now going since about one and a half years and if I would need to describe the nfdi for Earth or the nfdi in general in the tiniest nutshell possible I think it is about making data fair so findable accessible interoperable and reusable and I’m pretty sure you’ve all heard about these principles but this is really at the heart and at the core of all the NFD and everything else sort of um orbits around these very important and core principles here now the Earth system Sciences is a very broad field and and we are very proud within the nft for Earth that it that we were able to bring together all the various domains in the earth Sciences so for example geosphere the atmosphere biosphere hydrosphere and so on and all the way up also to include nites other planets and so on and so this is also why we are the second I think the second largest consort so we are one of the largest and um it was quite a change getting all under one but we managed and we I said quite proud of this and the idea is now to go from local processes to Global challenges and here’s where all the data coming because all of us all the researchers in the various earth science systems domains work on maybe soil science on climate on weather on Marine Sciences on the Rocks beneath them on maybe the Earth Dynamo or something like that and we want to we are um and we have all this data of maybe one location or one region and what we want to do is bring all these data together so that these can be accessed through one web page and by this we might be able to address Global challenges maybe if there’s a draw or something like that um then we have all the data available for this region to address this kind of um problem that might be emerging there and do this by of course observing measuring modeling analyzing and predicting the system so we bring in from all the various domains of what is possible of um accessing data into U into the nfdi here and then one very often funny part is of course we are the nfdi for earth and the nfdi and there’s this National in there and there’s sometimes a confusion how can this be National because we are International Community we are an international science and of course we are and the big thing about the nfdi and this National aspect is that we are not as is very often in in Academia in Germany restricted to States but we try to do this on this National level and this is of course an also Hefty and these kind of state consor come in as very important because they have this local integration as Maria already said with the researchers in University and research institutions and the nfdi itself the national Branch if you want is on this level where we try to bring everything together and the nfdi itself is then an an an element of the European open science Cloud because every country of iOS has to bring something to the table and Germany brings the nfdi to the table so this is how this um sort of works here now who are we we are 58 insti students I think we are now 60 actually um we start with 58 and you can see from all over Germany here um and we are not restricted to University or something like that where you have universities research organizations infrastructure providers governmental institutions scientific associations and networks and we started this in 2018 as an open Consortium and um now I said in 2021 or in 22 we got the funding and 2021 we started this entire nfdi for earthy and we are constantly growing and including additional stakeholders here and I said from all the different data providers so how do we start in Earth system Sciences so we have data from different sources and these sources are quite heterogeneous and this well actually is the biggest obstacle and the biggest challenge and actually also the justification why we are here as NF for Earth because all these sources are so different so hetrogeneous some have different levels of openness some are completely closed some are completely open some may want some money or some subscriptions and so on these different um infrastructures they have very often demanding models um they have maybe data intensive applications so maybe we need to share some common data centers to to calculate these models to operate these models the entire idea of the NF for Earth is to have a positive attitude towards openness Fair that’s what I said before and of course we want to be collaborative and um we start with established data rep repositories and then we have again the challenge of for example getting together several standards for interoperability into a single into a single web interface basically so we started by trying and collecting data services and we found about about 150 data services so again a large heterogenity here with all activities related to our system Sciences only a few of these sustainable so very often these are small databases maybe local to a university maybe local to a computer of an individual researchers but of very high value and we need to collect these and put these into something like maybe say a knowledge Hub or something like this that can then be accessed through a web interface so that you really find all these important resour ources here we have a very high data heterogenity in quality curation levels licenses semantics and so on and we need to find ways to homogenize these or find mapping tools so that’s if you want to access various databases we can do this through one single tool and not need to rely on a lot of different tools there are different data cultures of course in velocities towards fair this is also a challenge we need to address there’s incomplete support along the data life cycle which is here on the right side the data life cycle and very often when it comes to publishing results maybe in repositories it’s a bit difficult I mean there’s the machine you produce data then there’s the repositor but in between there is the researcher and uh she’s not very often not necessarily the fun part of of her or his work to put these data also into repositories and then there’s a lack of support of platform tools and this is something you want to solve here in this nfdi for Earth so our key goals coming from these issues just described is we want to have a one Community approach to sustainable open and fair research data management in system Sciences so we really try to get everyone on board so this is not a top- down approach we have but it’s a bottom up approach we have and for this we have a couple of tools how researchers can participate and I will show you these tools a little bit later um because I think these are one of the most important instruments we have in the nfdi for Earth and on which we rely and which is actually also a large chunk of money we spent on so this means Community Driven Agile development um of platforms for data integration collaborative data analysis and so on you want to have qualification for people to produce data tools services as a basis for fair research data management and we want to facilitate all this through something we call a onetop for all and a onetop for all is basically a web page um to which you can access everything so think of Google maybe it’s also just a web page but it’s it was a game changer and still is so because through search engines such as Google or du. go whatever um we get access to what we want and this is basically what we aim for to have one single web page through which you can access and reach everything you need in the domain of the system sciences and related to data and everything around data now this um I hoped a little bit I have to admit that um the by November because this was the initial plan had released this OneStop wall so I could really show it to you but unfortunately we are a little bit later so we have a soft release later this year and a hard relas relas then early next year otherwise it would have been of course even nicer to really demonstrate this one stop for all I have seen some pre-release and it’s really very nice I can assure you of that and one of the things is with this on stop for we also have a user support network so that it’s not all just digital all just web pages you are directed to but if there’s something you really don’t get an answer through the one stop for all there’s a user support network behind it and you are directed to a human person who then tries at least to answer your question and uh with this approach we have here we hope also maybe to be a key driver of the buildup and operation of the entire nfdi or bringing our part into it of course other consultas certainly also have some great ideas to incorporate these are sort of the ones we have a little bit here so from a individual users perspective a user so so she goes here researcher student so to do one stop for all hopefully she gets her answer there already via one of our products we have I will come to these products otherwise she’s directed to the user support network and this is then contributed by participants of the nfdi for Earth here so this is something we are currently building up so I’m briefly going into this this concept you have not too much because these charts are always a bit hard to memorize so what you have so this is again this one stop for all here in the center with the user support network and then we have something like a living handbook so this is actually what we do here in Frankfurt this is our responsibility to living handbook so this collects articles about first of all the outcomes of the nfdi for Earth itself um but also then for example maybe you want you need a data management plan for your next proposal and you don’t know what a data management plan is so you then there will also be hopefully at some point then an article about the data management plan in the living handbook so you can go there find this resource how to write a data management plan for the Earth system Sciences for my next proposal in here another one is the knowledge Hub and as I said before so maybe there are lots of data or there are lots of databases small large ones and the knowledge Hub collects all these databases that are available so then you can access them and find really them through collaborative analysis tools and on now one of the really core products and I want to go into this a little bit is this part because this is how everyone can really participate and provide important input to the nfdi for Earth and participate in it and I will show you this a little bit more detail so we have pilots pilots are one-year projects so you can apply at the nfdi for Earth for such one-year project so this is um full time equivalent it is about 75,000 you get to conduct one research data management question so of course this should have a relation to the nfdi for Earth so we had for example something about data cubes something about metadata vocabularies ontologies um some model integration and I will show you one example which was part of one of the pilots we actually did what what came out of it and just to give you some some idea what is meant here in the second round just started and the next call will be live in 2024 this in the first call we had 14 pilots in the second one I think it was seven well depends all on funding we have as well let’s see how many we have next year another one I incubators this is sort of similar to Pilot just shorter so you have three to six months you can apply for also full-time equivalent so this is then maybe something between I know 15 and 30,000 you can can apply for and this is more high risk so if you have really want to try out something if you have an idea and you want to try this idea you can apply for an incubator and then you get three months or four months or something like that to try out your idea see how it works and depending on this maybe apply for a pilot apply for some for a DF dfk propos something like that second round also just started so next call you be in 2024 then you have education Training Services so this will be or this is um a platform for sharing educational material and here again I’m not going through all this here again we have some Pilots so you can so this is basically twofold one of one part is that people can upload or provide their learning material Maybe videos maybe scripts whatever through a portal a training portal of the nfdi for Earth and if you maybe want to build something say you want to make a couple of videos and you need some support you can apply for an educational pilot here and you also get funding and this is on a continuous application so you can apply for this anytime so there’s also contact but all the contacts are of course of the on the ND website then we have the academy and the academy is for young researchers so it’s an early Career Training Network peer mentor environment open Academy program and we have the first year cohort and now we are currently the application is open until 30th of November in case someone would be interested for the second cohort and this is basically these are I think 35 at the moment young researchers from everywhere over Germany and then they come together and this is then funded so these meetings they have these are basically funded to exchange about data science in general present their projects make a hackaton something like that so this is also really a very great instrument here and finally we have something that’s called interest groups so these are the currently operating interest groups on Research software machine learning long-term storage And archiving and couple more here and this is if some researchers have a certain contribution they want to make to the nfdi for for Earth um they can form such an inter group where they Define maybe a standard produce a white paper about something and this is then a means of um of doing this here so this at the moment has no funding because um it was we thought it’s not entirely necessary but if some funding would be required for I know maybe a meeting or so this might be also a possibility here so these are the means of participating in the nfdi for Earth and as said I think this is really one of the great aspects and instruments we have here at work so our strategy looks as follows please don’t have a look at here in detail I just wanted point out that we are here in F for commment so this should be my name it is an old one forgot to change this here so I’m responsible for this nfdi for Commons theme here and as said before we are responsible for living handbook and as we are responsible for this I thought I could quickly just show you three or four slides about how this is working what we are doing so what is the living handbook it is essentially text based information with articles but of course can include media like images like maybe a table or video or something like that we are even ambitious maybe can include some Jupiter notebooks in the future but this is not this will not be an option when we start in the next couple of weeks with this and um the structure is that we are Wick like encyclopedia so anyone can submit an article the article can be changed can be updated someone else can add something to the article but it can also be a white paper so where there’s no change after the publication of this white paper of course it’s possible to have collections so if again maybe you have a data management plan about system Sciences it might be in the collection of data management plan where there’s another article about how to write a data management plan how to upload or whatever data management plan but could also be one of these articles in another collection only about proposal writing or something like that this partially peer reviewed so if necessary it will be peer reviewed if it’s a right paper or if it’s a really article about um some specific I know maybe vocabulary or something like that so someone suggests vocabulary for rocks then this would likely be peer reviewed so to make sure that this is not nonsense that’s submitted here and it’s everything is curated by an editorial board and everyone again as it is a community approach if you would be interested in joining the editorial board it would be possible that’s what’s said here so it’s Community Driven content initial set of artists currently prepared we have about 100 articles right now many of these are for the moment outcomes of the nfdi for Earth for example the outcomes of the pilots so if you’re interested what are these Pilots what have they done um you can find these articles in the living handbook also for the incubators or from the educational Hub um and here so you could submit your article directly here of course once the website is live this will be a button that you can click and not discom some email address here or you can just ask me now this is about for the nfdi and then this is just touching on tributor hubs because we initially thought about um I said in conjunction with Hefty Maybe to work on a Jupiter Hub now this changed a little bit as this is a fast pacing area here so it was always thought that the nfdi should have some basic Services something like an authentication Service something like Ed roome or something like that and it was not clear until recently how this would be realized and then the idea was to as one consorti have that is a little bit different from the others have one that provides these basic services and this consult called base for nfdi and it is now also funded and now Services can apply for being Incorporated in the base for nfdi basic services and one of these Services that’s currently under review as far as I know is for example Jupiter hubs and there are some big players from Germany who uh participate there and then we thought it does not make a lot of sense to also try and set up a Jupiter Hub if we now have this as a basic service so the the need sort of disappeared here so we replace this a little bit by something by some new developments which is making web applications and there’s currently and with this I’m coming to the last part of my part of the presentation here which is the outcome for pilot and I want would like to show you one of such web applications that we built and that has a direct connection to research data management I’m going through this rather briefly so this is about um so this is a web application so I’m now live in the browser and so think of you have a a lab and produce data and now this data come out of the machine in a not very nice format but you want to provide it in a more nicer format so what you can do then is you can store it for example in an electronic lab notebook and four weeks ago there was a presentation by Mel Sela very nice about cardi forat and actually we use part format here so use an electronic lab notebook so all the cers some data come out of the machine go into this electronic lab notebook and I’m not showing this here Mel showed this two talks ago so but now switching to tamana and uh thank you and please tamana go on with your part thank you Dominic all right so good morning everyone I’m tamana and the topic for my talk today is is automated classifications of rock using machine learning so to begin with I would like to give a brief introduction about our field of study geology is the scientific study of Earth’s structure composition and history it plays an important role in understanding our planet’s past and predicting its future Ignus rocks are one of the three primary rocks that are found on Earth and they originate from the solidification of the molten rock which is known as magma they are of two types intrusive and extrusive intrusive ous rocks are the ones that form beneath the Earth surface with slow Cooling and they result in coarse grain textures as we can see here for example diorite and granite whereas the extrusive ous rocks are the ones that form on the surface with rapid cooling resulting in fine grained or glassy textures like Basalt and andesite so we generally classify them on the basis of major element oxides for example silica oxide aluminium oxide iron oxide Etc so uh this is a classical method for rock classification and this plot is known as Tas plot total alkal silica plot it is the fundamental and most widely applied tool in petology and geology and it is used to classify extrusive rocks based on the composition as you can see a lot of uh rock types and it is also instrumental in understanding the difference iation of magma so on the x axis we have the silic oxide and on the Y AIS we have the sum of two Alkali oxides sodium and potassium so for example we take the study of Basalt for basal the silica value ranges from 45 to 52 weight per and the sodium and potassium ranges from about 4 to between 4 to 6% weight percent so what is the aim of this study so as we know that the volume of geochemical data has exponentially increased over the last few decades machine learning techniques can efficiently be used to handle and process large data sets and make it possible to derive valuable insights from extensive archives of historical data so we aim to discern previously unrecognized cor relations in the already existing databases within the G olical evolution of rocks and also look for correlation or groups within the same field for example the basol field as I mentioned in the last slide and thus expand the scope of knowledge in our domain and as we know that machine learning is able to provide more comprehensive data analysis and has better results to an extent there has been a lot of work that has been done in geology using machine learning and these are the various areas where machine learning has been applied for example the lithology classification the identification of leth lithologies source of material special GE geological events mineralizations zones Etc are the geochemical are the bases of the earth science so um using machine learning using inputting the image data the model was able to recognize different rocks from igna sedimentary and metamorphic rocks second we have the mineral classification mcendy in 2016 was able to create a model to identify magnetite in hydrother and metamorphic volcanology massive sulfide deposits also a lot of work has been done using the image data to classify various minerals on as seen under the microscope as well recognizing or deposits researchers these days are also exploring feasible machine learning methods on digital mapping so using neural networks one can predict the hydrothermal gold and silver deposits and it had an accuracy of More than 70% and it had been it has been tested in the area of Iran and anomaly detection so a hybrid machine learning model with the K nearest neighbor regression and random Forest were able to classify zinc and Lead anomalies and and they were also able to classify the grades of these anomalies they were further uh correlated with the mining activities and drilling data and the last using the databases classification of classic classification tree models were developed to distinguish Bol from different tectonic settings using neural networks so uh there are two types of learnings in machine learning and these are supervised and um unsupervised and supervised so I will take you through the major differences between them so first is the labeling of the data in supervis learning the data is not labeled for example as you see here I have inputed the major element oxides and but I do not mention the name of the rock this data corresponds to whereas in the supervised learning I have same measor element oxides however in this in this time I will also mention the rock type or The Rock name each data corresponds to and the goal of the unsupervised learning is to discover hidden patterns whereas the supervised learning is rather used to learn a mapping or relationship between input features and their Associated Target values and then make predictions the task type for unsupervised learning is mostly clustering which which is grouping data points with similar features and dimensional and dimensionality reduction which is reduce the number of features whereas the task type for supervised learning is mostly classification and regression which is predicting the continuous values and since the data is unlabeled there is no human guidance required for unsupervised learning whereas we have to label the data in the supervised learning so human guidance for training the model in supervised learning is important and due to all the above reasons the evaluation in un supervised learning is more challenging and subjective whereas the supervised learning model the accuracy can be checked so these are some preliminary results of my of my model and the plot presented here represents the output of a variational autoencoder which is a generative model they are used to learn the features U on the basis of probability distribution of the data in this visualization we observe the presence of 10 distinct clusters when examining the two probability distributions I reduced my data to so the Clusters indicate that the model has successfully learned and separated different patterns or groups within the data itself and these are the results of my supervised learning I inputed 10 different types of rock data here again but with labels and this time it the model was able to distinguish 92% with an accuracy of 92% which means like out of every 1,000 out of th000 simple samples that were predicted 915 were classified correctly so to carry on this machine learning model where was the data what was the data we used so we gather data from the online portals Goro osam pdv and one petology Goro is a comprehensive res resource for geological data and the available data is collected from wide range of sources the databases can be downloaded on the basis of Rocks locations which include various tectronic settings and also minerals and inclusions pet DB is the petrological database of the ocean floor it offers a wide range of information on the composition and characteristics of rocks minerals and fluids as you can see the data can be downloaded on the basis of location features sample type chemistry Etc okam is an initiative by the geoscience Australia that provides access to geochemical data across the Australian continent it is a comprehensive repository and geologic of geological and geochemical information one petology is one of the nodes of the deep time digital Earth and it is responsible for construction of database of magmatic rocks in the website data can be chosen from various options like Ignus rocks uh zuran data isotopic data as well as particular data just from the China region uh for our analysis we also created a random data fake random data set within a Jupiter notebook using Python and the primary purpose behind the generation of fake data was to access and validate the functionality and prec precision of the mL of the machine learning model that was developed additionally this endeavor sought to evaluate the model’s performance when exposed to Noble data while also Asser uh assing the effectiveness of the model in accurately grouping the data so these are the results of the fake data of the machine learning and this is the same task plot as we saw earlier so the green green dots indicate the correct classifications the model was Stained on the databases the data that was from the databases and it was tested on the fake data so here we see also the accuracy increased from 92 to 94% when tested on fake data and when we see the misclassifications of the Rocks we generally see the the misclassifications are either on these boundaries or the intersections meaning intersections of four different type of rocks and also we see the major misclassifications of between basite and teite so accuracy from 92 to 94% increased this means there were originally there were some misclassifications in the original data as well so for data X while Gathering the data there was some challenges that we had to face and uh that was the data was available on different portals a lot of data was available actually and for a which is accessible some of the port portals had an have an an API While others not so uh so it was pretty difficult to for for each and every portal to get data and for I which is interoperable there they all the portals lack common vocabulary so if if one of the portal follows a certain format the other had a completely new format so and when I collect all the data and then um I had to use it to my machine for my ma machine learning model I had to create an entirely new database out of all the databases I downloaded so the lack of uh uniform formatting across all the data portals posed significant challenges particularly in terms of interoperability and aing to the fair principles this hindered the ability to make data accessible as well as inter reparable across different machine learning models and it affected the efficiency of data retrieval so now initiatives by FD and nfdi for Earth are there to centralize this these databases so that databases are found in at one portal and which will further help in using this entire data for various machine learning models in the future as well thank you for your attention thank you very much to both of you

HeFDI Data Talk: NFDI4Earth – 03.11.2023