Meta-analysis of bacterial mock communities reveals status of FAIR principles and impact of protocol biases on microbiome sequencing results – Luise Rauer – MICROBIOME – ISMB/ECCB 2023
Thank you okay um hi it’s great to see so many people here I hope you didn’t stay just because you have a seat now in this room um but instead I’m assuming that you’re either interested in Fair principles or in protocol biases in microbiome sequencing data
And that means that we share at least one research interest so hi I’m Louise Okawa I’m a doctor researcher at the University of Augsburg and the Technical University of Munich in Germany and I’m presenting our meta-analysis of bacteria in more communities so whenever we do microbiome sequencing our data are basically distorted by
Multiple biases that we introduce in the sample processing so basically we can introduce biases during the sampling during the DNA extraction during amplification and sequencing and even our bioinformatic and statistical choices affect the results that we get in the end so basically our observed composition of the microbiome almost never represents
What we expected in the beginning how do I know the expected composition well I’m working with more communities so more communities are samples with a known composition and known species properties so we know exactly what we put in and then we can also measure what we get out
And they are Selma communities where we have just bacteria and DNA more communities where we have DNA of the bacteria and the good thing is that many people use these mock communities as positive controls for example in clinical microbiome studies and I’m interested in these protocol biases
So I thought hey if people are using these positive controls let’s just exploit the vast variety of protocols that is already out there and let’s do a meta-analysis of these more communities so this is what we did so first we identified commercial bacterial mock communities there are around 34 from three different companies
Then we searched for Publications using these more communities in Google Scholar and then we applied inclusion and exclusion criteria and um this is the full scheme I would like to highlight uh two things here so first we focused on Illumina 16s sequencing um which was basically just the most
Widely used method so we were able to keep 79 of our studies but then in contrast the major exclusion Criterium was that actually the raw data was not available so here we were only able to keep around 30 of our studies and uh I would like to show you this in
A little more detail so here we have the number of studies that we included in our analysis and the year when they were published so you can see microbiome Publications are nicely going up now what I expected in terms of Publications depositing their data or not depositing
The data is that it goes down right journals are increasingly enforcing these rules that you should deposit your data but oh it’s actually going up in the same way and the main um yeah of course so sometimes you find this available upon request which in principle is fine but then there were
Also cases when the information was stated where to find the raw data but then there is actually no raw data so here’s one example from 2018 I mean we are all busy as scientists but I don’t think that these researchers are still working on uploading their data and for the last um
Execution Criterium I would like to play a little game with you which I call spot the mock so so here I have all the raw data that was uploaded in the study and the only information that we have is incorporation of a mock Community doing sequencing
Who knows uh what could be the mock sample in this case it’s quite hard so I think in this case it was the PC positive control which makes sense but there’s also a lot of other abbreviations here and there’s no additional information and this was actually one of the easier
Cases so this is a study that we included in the end so um yeah this last category again just has the same rate it’s also going up so we just have a significant proportion of articles that either do not deposit their raw data or where raw data is just not clearly labeled
Okay so now we have applied our inclusion and exclusion criteria then we were downloading the first Q files processing them jointly to generate a huge ASV table um which just contains samples and then basically the general and then account data and then from the Publications we extracted the sample metadata so how
These samples were actually generated and this brings us to the problem basically how do we squeeze these tons of different pipelines that are available for generating microbiome data into one scheme and luckily there’s already some work done so maybe you are aware of the storms checklist or of the my XS
Specifications mixer specifications so building on these Publications we collected around 100 variables of these sample meter data from the Publications and then we can analyze how frequently this metadata is reported in the Publications so here on the left side we have just the Publications that use the Selma
Community so it includes all the steps that you apply to sell more community and on the right side we have all the steps that come after DNA extraction so that you would also apply to a DNA more community and you can see that for example the extraction kit or the hyper variable
Region like the really key information of any microbiome project they are given in 100 of the cases but then other information for example on the PCR conditions like how many PCR cycles that you use uh there the information already goes drastically down to two-thirds of the study when it
Comes to the second cycle of PCR we have only one third of the study reporting the details so there’s some information licking here and this is what the data looks like in the end so we have four different mock communities shown in this figure and I think the most important thing
That you can see is that there’s quite some variability first within a mock community so just from the different protocols but also even within a study there’s quite huge variability which is basically just random variability um but you can also quite easily distinguish the more communities from
Each other which is a good message for us right it means if you have totally complete samples in your microbiome experiment you process them with totally different pipelines you still get totally separated samples that’s a good sign right but we were not interested in finding differences between more communities but
In finding differences between the protocols so basically we corrected for the expected composition um focused only on the share Genera removed some outliers and now the more communities are nicely mixed and we can start studying some biases um unfortunately or fortunately we see some differences based on the hyper
Variable region and I’m sure that there would be more differences for example based on the primers that were used in the studies but in this analysis we are really limited first by the information that is given in the papers as you have seen before and on the other hand we are
Limited by the fact that some of these studies are so specific that we cannot really say if that’s just the study specific effect or if that’s really because of the methodology however when we look at the cell marks it looks a little better so here we can
Clearly see that we have huge effects based on the extraction kits and we can explain up to 93 of the variation for some of the general just based on the extraction kit that was used foreign if you’re interested in extraction bias I’m also here with another poster which
I present today tonight in the evening it’s number 235 and uh yeah we can bioinformatically correct extraction buyers so I’m happy to also discuss this if you’re further interested but basically to summarize this project we first found that there’s basically really lacking data availability in terms of the data that has been
Deposited and there’s no improvement over time at the moment second there’s also missing or that we miss the critical sample metadata in order for us to do the study but also that would be needed for this full Laboratory reproducibility but in general we have this powerful approach where we can take the
More communities and measure all these biases we are limited at the moment basically in this step where we are extracting the sample metadata from the publication so if anyone has an idea how to automatically extract this content from the Publications I would be happy to hear about that
And to give you a few more General conclusions um I would be happy if you use more communities as positive controls in your research this would also help you to see if your results are really working and I would like to say that I’m really not here to to judge anyone I found
Mistakes in my own research like this where I was not adhering to these standards but if you can please try to adhere to the reporting standards and try to go Fair and with that I would like to thank my co-authors and I would like to thank you
For your attention and I’m happy to take questions