Notebooks¶
Jupyter python notebooks¶
This folder contains various iterations of python notebooks for different occasions. We can recommend the colab notebooks:
read mzQC notebook example¶
Welcome to the 5-Minute mzQC interactive guides with python!¶
This python notebook will guide you through your first steps with reading someone's mzQC in python.
First we will explore how to open and read a mzQC file, explore its' content and dataframe integration, and finally visualise a metric for interactive view in colab! (We assume it is not the first time you get your feet wet with python, otherwise rather plan for 25 minutes.)
#@title This will install the latest version of pymzqc
%pip install pymzqc --quiet
Then, we load this right into our python session by loading pymzqc (from mzqc). We'll also utilise some other libraries, too.
For example, we will use requests to load some data from the web.
from mzqc import MZQCFile as qc
import pandas as pd
from io import StringIO
import requests
import plotly.express as px
#@title Upload `.mzQC` file here!
from google.colab import files
try:
uploaded = files.upload()
except:
print("If that does not work, proceed with option 2.")
# maybe an alternative?
# from google.colab import drive
# drive.mount('/content/drive')
If that does not work, proceed with option 2.
And load it as JsonSerialisable from file:
#@title Provide a file object by `open` either a colab uploaded file or a file path if you are using this notebook locally.
with open(uploaded['filename.mzQC'], "r") as file:
some_mzqc = qc.JsonSerialisable.FromJson(file)
#@title Loading files from the web works only a little different.
response = requests.get('https://github.com/HUPO-PSI/mzQC/raw/main/doc/examples/metabo-batches.mzQC')
some_mzqc = qc.JsonSerialisable.FromJson(response.text)
Inspect data¶
We can now go ahead and take a look what that file we loaded has on offer:
print(some_mzqc.description)
This dataset is based on the analysis of polar extracts from a nucleotype-plasmotype combination study of Arabidopsis for 58 different genotypes. For details of the used plant material we refer to Flood (2015). Analysis of the polar, derivatized metabolites by GC-ToF-MS (Agilent 6890 GC coupled to a Leco Pegasus III MS) and processing of the data were done as described in Villafort Carvalho et al. (2015). Here, the number of metabolites (75) is much lower than in the other two data sets, partly because the focus was on the primary rather than the secondary metabolites. The number of samples was 240, with a percentage of non-detects of 16 %; the maximum fraction of non-detects in individual metabolites is 92 %. All metabolites were retained in the analysis. Four batches of 31-89 samples were employed, containing 2-6 QCs per batch, 14 in total.
pymzqc deserialised JSON arrays (i.e. the runQualities and their qualityMetrics) can be used like python lists:
for m in some_mzqc.runQualities[0].qualityMetrics:
print(m.name)
Detected Compounds
You can traverse the hierarchy with standard python member access notation (.) and get to the bottom of things (like a metric name or value).
some_mzqc.runQualities[0].qualityMetrics[0].value
57
We can also get the table metrics directly as pandas dataframe! 🤯
from google.colab import data_table
data_table.enable_dataframe_formatter()
df = pd.DataFrame(some_mzqc.setQualities[0].qualityMetrics[0].value)
df
| Run name | PCA Dimension 1 | PCA Dimension 2 | PCA Dimension 3 | PCA Dimension 4 | PCA Dimension 5 | Injection sequence number | Batch label | |
|---|---|---|---|---|---|---|---|---|
| 0 | GCMS ToF sample 10 | -3.348963 | -2.341435 | -1.486755 | -0.276620 | -2.683632 | 13 | 4 |
| 1 | GCMS ToF sample 100 | 0.419126 | 2.055220 | -0.396590 | 1.780880 | -2.020238 | 16 | 7 |
| 2 | GCMS ToF sample 101 | 6.824155 | 1.514235 | 1.163668 | 0.173623 | -3.088806 | 17 | 7 |
| 3 | GCMS ToF sample 102 | -1.080886 | 2.994551 | 0.421837 | 0.013869 | -3.161503 | 18 | 7 |
| 4 | GCMS ToF sample 103 | -1.707045 | 6.070161 | 0.627950 | -0.628732 | -1.064512 | 19 | 7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 221 | GCMS ToF sample 95 | 4.246444 | 1.841355 | 0.670941 | 0.938284 | -1.520605 | 11 | 7 |
| 222 | GCMS ToF sample 96 | 2.368072 | 1.001514 | 1.235160 | -0.091535 | -3.000758 | 12 | 7 |
| 223 | GCMS ToF sample 97 | 1.674060 | 0.450211 | -0.512383 | 2.100324 | -1.654505 | 13 | 7 |
| 224 | GCMS ToF sample 98 | -2.928426 | 4.175587 | -0.449582 | 0.333712 | -0.814875 | 14 | 7 |
| 225 | GCMS ToF sample 99 | 10.000164 | 1.918147 | -1.348393 | 0.275859 | 0.749887 | 15 | 7 |
226 rows × 8 columns
Visualising our data¶
It is always good to have a look your data visualised. The former metric provides us with all we need to plot the first two PCA dimensions and add interactive labels.
df['Batch name'] = df['Batch label'].map(str)
fig = px.scatter(df, x="PCA Dimension 1", y="PCA Dimension 2", color="Batch name",
hover_name="Run name", hover_data=["Injection sequence number", "Batch label"])
fig.show()
write mzQC notebook example¶
Welcome to the 5-Minute mzQC interactive guides with python!¶
This python notebook will guide you through your first steps with writing your own mzQC in python.
We will calculate a QC metric from our data, visualise it, put it in a controlled vocabulary object, add some metadata about the ms-experiment the data came from, and finally create our first mzQC file! (We assume it is not the first time you get your feet wet with python, otherwise rather plan for 25 minutes.)
#@title This will install the latest version of pymzqc
%pip install pymzqc --quiet
Then, we start right in by loading pymzqc (from mzqc). We'll also utilise some other libraries, too.
For example, we will use requests to load some data from the web. This we will use as the computational data basis for the 'QC' metric value we are calculating and later visualise in this python notebook.
from mzqc import MZQCFile as qc
import numpy as np
from io import StringIO
import requests
import matplotlib.pyplot as plt
# Get some input data from the interwebs
url = "https://raw.githubusercontent.com/percolator/percolator/master/data/percolator/tab/percolatorTab"
r = requests.get(url)
if r.ok:
data = r.content.decode('utf8')
data = '\n'.join(data.split('\n')[2:]) # remove the nasty extra header row
The data is identification data from a yeast digest 2h gradient run, just before you would subject your findings to percolator. We will take a closer look at the mass discrepancies of the found PSMs.
dm_source = np.loadtxt(StringIO(data), usecols=(2,3,4), delimiter='\t', dtype={'names': ('ScanNr','ExpMass', 'CalcMass'), 'formats': ('i4', 'f4', 'f4')})
dm = dm_source['CalcMass'] - dm_source['ExpMass']
Visualising our data¶
It is good practice to have a look at the data with some basic visualisation. Next, we will plot the mass errors in a histogram.
plt.hist(dm, bins=10)
plt.gca().set(title='Delta Mass Histogram', ylabel='Frequency')
[Text(0, 0.5, 'Frequency'), Text(0.5, 1.0, 'Delta Mass Histogram')]
The distribution seems centered around 0, which is good but the histogram has a rather wide bin setting, so we can check with another visualistation. Let's take a look at the violin plot of the absolute error.
fig,ax = plt.subplots(1)
ax.violinplot(np.abs(dm))
# plt.violinplot(dm)
ax.set_ylabel('Mass error')
ax.set_xlabel('Frequency')
# Turn off x tick labels
ax.set_xticklabels([])
[]
This looks suspicious, we better make some QC metrics to document that! We will describe the found distribution with $\sigma,Q_1,Q_2,Q_3$ and put it in a mzQC file to share with others.
Creating our metrics¶
qm1 = qc.QualityMetric(accession="QC:0000000", name="Mass error sigma", value=np.std(dm))
q1,q2,q3 = np.quantile(dm, [0.25,0.5,0.75])
qm2 = qc.QualityMetric(accession="QC:0000000", name="Mass error Q1, Q2, Q3", value=(q1,q2,q3))
print("Mass error Q1={}, Q2={}, Q3={}, sigma={}".format(qm2.value[0],qm2.value[1],qm2.value[2],qm1.value))
Mass error Q1=-0.0098876953125, Q2=0.0, Q3=0.00299072265625, sigma=0.008799662813544273
Creating a controlled vocabulary defined metric in mzQC is easy! Look for the fitting entry in our ontology or request a new one. Note the name and the accession, then set them as the QualityMetrics name and accession. We also need to keep a reference from which ontology this metric came from, so we note this as 'QC'. We will provide details about the ontologies used later.
Setting the actual value of a metric is easy, just collect the values you calculated. (They have to be appropriate to the definition of the metric, though.)
Keep track of your sources¶
infi_peak = qc.InputFile(name="file.raw",location="file:///dev/null/file.raw",
fileFormat=qc.CvParameter(accession="MS:1000563", name="Thermo RAW format"),
fileProperties=[qc.CvParameter(accession="MS:1000747",
name="completion time",
value="2015-10-03-T10:18:27Z"),
qc.CvParameter(accession="MS:1000569",
name="SHA-1",
value="76de62feccaaaadb608e89d897db57135e39ad87"
),
qc.CvParameter(accession="MS:1000031",
name="instrument model",
value="LTQ Orbitrap Velos"
)
])
It is crucial to provide some more information apart from the metrics themselves. To put the metrics into context, it is essential to keep track of the peak file or acquisition file that constitutes the basis to later data. In our case it is a thermo raw file from which we did the identifications from which we calculated our metric. We need to keep some more information about the origins file apart from the name:
- the completion time
- the checksum
- the instrument type
infi_ids = qc.InputFile(name="afteridbeforepercolat.tsv",location="file:///dev/null",
fileFormat=qc.CvParameter(accession="MS:1000914",
name="tab delimited text format"),
)
We also want at least to know about the software that was involved in the process of creating the metrics. In our case that is the software that produced the identifications - comet, which has a PSI-MS cv term we can use, and obviously this python notebook.
anso_mzqc = qc.AnalysisSoftware(accession="MS:1002251",
name="Comet",
version="2018.01.1",
uri="http://proteomicsresource.washington.edu/sequest.php")
anso_nb = qc.AnalysisSoftware(version="0.1.2.3", uri="file:///mylocal/jupyter/host")
Assemble a mzQC file (object)¶
Now we have all the neccessary information in place to build a descriptive object of the experiment and analysis for the mzQC file:
The metadata
- inputFiles
- analysisSoftware
and our qualityMetrics
- qm1
- qm2
meta = qc.MetaDataParameters(inputFiles=[infi_peak, infi_ids],analysisSoftware=[anso_mzqc,anso_nb])
rq = qc.RunQuality(metadata=meta, qualityMetrics=[qm1, qm2])
Now it is time to provide some more information about the ontologies which' terms we were using, so that the next one to read our file can look up the metric definitions, understand what the values mean, and use the data in further analysis.
cv_qc = qc.ControlledVocabulary(name="Proteomics Standards Initiative Quality Control Ontology",version="0.1.0", uri="https://github.com/HUPO-PSI/qcML-development/blob/master/cv/v0_1_0/qc-cv.obo")
cv_ms = qc.ControlledVocabulary(name="Proteomics Standards Initiative Mass Spectrometry Ontology",version="4.1.7", uri="https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo")
Finally we can put the last pieces together and form a mzQC file. We also note the current time in ISO 8601 format, the version of the mzQC format, we were writing. And so we made our first mzQC file.
from datetime import datetime
mzqc = qc.MzQcFile(version="1.0.0", creationDate=datetime.now().isoformat(), runQualities=[rq], setQualities=[], controlledVocabularies=[cv_qc, cv_ms])
With some extra python notebook magic, we can even download the fruit of our labour. (You need to execute all cells successfully in order to get the download link.)
from IPython.display import HTML
import base64
def download_mzqc(mzqc, title, filename):
inmem_file = qc.JsonSerialisable.ToJson(mzqc)
mzqc_b64 = base64.b64encode(inmem_file.encode())
payload = mzqc_b64.decode()
html = '<a download="{filename}" href="data:application/json;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
download_mzqc(mzqc, "our first mzqc file", "first.mzQC")