Curate entity identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard. Bionty enables this by curating data against the versionized ontologies using curate().

We’ll demonstrate this by first curating genes and second CellMarkers where not all values can be immediately mapped.

Let’s start by importing the required modules from Bionty and Pandas.

from bionty import Gene, CellMarker, lookup
import pandas as pd

Curating genes#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol hgnc id
ensembl_gene_id
ENSG00000148584 A1CF HGNC:24086
ENSG00000121410 A1BG HGNC:5
ENSG00000188389 FANCD1 HGNC:1101
corrupted corrupted corrupted

We require a reference identifier (specified as the reference_id parameter for curate). The list can be looked up using lookup(). Examples are “ontology_id”, which corresponds to the IDs of the ontology terms (e.g. ‘ENSG00000148584’) or “name” which corresponds to the ontology term names (e.g. ‘A1CF’).

lookup.gene_id
feature(description='description', ncbi_gene_id='ncbi_gene_id', ensembl_protein_id='ensembl_protein_id', omim_id='omim_id', gene_type='gene_type', symbol='symbol', ensembl_transcript_id='ensembl_transcript_id', synonyms='synonyms', hgnc_id='hgnc_id', mgi_id='mgi_id', ensembl_gene_id='ensembl_gene_id')

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier. By default we use ensembl_gene_id. The default behavior is to curate the index if a column name is not provided.

First we create a Gene() instance using the default source database and version.

gene = Gene()

First we can check whether any of our values are mappable against the ontology.

gene.inspect(df_orig.index, reference_id=gene.ensembl_gene_id)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
 'not_mapped': ['corrupted']}

We have identified 3 terms that are mappable against the Ontology. Let’s curate them by mapping them against the ontology. By default, Bionty uses the index column if not specified otherwise.

gene.curate(df_orig)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
gene symbol hgnc id orig_index __curated__
ensembl_gene_id
ENSG00000148584 A1CF HGNC:24086 ENSG00000148584 True
ENSG00000121410 A1BG HGNC:5 ENSG00000121410 True
ENSG00000188389 FANCD1 HGNC:1101 ENSG00000188389 True
corrupted corrupted corrupted corrupted False

The curated DataFrame has now been reindexed by the curated cell types. A new column orig_index containing the original index has been added. Furthermore, a new column __curated__ containing booleans of whether the data could be successfully curated or not has been added.

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

gene.inspect(df_orig["gene symbol"], reference_id=gene.symbol)
🔶 The identifiers contain synonyms!
💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'
✅ 2 terms (50.0%) are mapped.
🔶 2 terms (50.0%) are not mapped.
{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

mapped_symbol_synonyms = gene.map_synonyms(
    df_orig["gene symbol"], reference_id=gene.symbol
)
mapped_symbol_synonyms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']

We can store them in our DataFrame further use.

df_orig["non-synonymous gene symbol"] = mapped_symbol_synonyms

You may provide a column name to curate a specific column against a reference identifier. When mapping symbols, the function will automatically convert the aliases into standardized symbols. In this example, FANCD1 is converted into BRACA2.

gene.curate(df_orig, column="gene symbol", reference_id=lookup.gene_id.symbol)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
gene symbol hgnc id non-synonymous gene symbol ensembl_gene_id __curated__
symbol
13736.0 A1CF HGNC:24086 A1CF ENSG00000148584 True
24881.0 A1BG HGNC:5 A1BG ENSG00000121410 True
17731.0 FANCD1 HGNC:1101 BRCA2 ENSG00000188389 True
corrupted corrupted corrupted corrupted corrupted False

This is synonymous to:

gene.curate(df_orig, column="gene symbol", reference_id=gene.symbol)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
gene symbol hgnc id non-synonymous gene symbol ensembl_gene_id __curated__
symbol
13736.0 A1CF HGNC:24086 A1CF ENSG00000148584 True
24881.0 A1BG HGNC:5 A1BG ENSG00000121410 True
17731.0 FANCD1 HGNC:1101 BRCA2 ENSG00000188389 True
corrupted corrupted corrupted corrupted corrupted False

Of course this also works with other columns such as “hgnc id”.

gene.curate(df_orig, column="hgnc id", reference_id=lookup.gene_id.hgnc_id)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
gene symbol hgnc id non-synonymous gene symbol ensembl_gene_id __curated__
hgnc_id
13736.0 A1CF HGNC:24086 A1CF ENSG00000148584 True
24881.0 A1BG HGNC:5 A1BG ENSG00000121410 True
17731.0 FANCD1 HGNC:1101 BRCA2 ENSG00000188389 True
corrupted corrupted corrupted corrupted corrupted False

Match (unmappable) cell markers to the reference#

Depending on how the data was collected and which terminology was used, it is not always possible to curate the values. Some values might have used a different standard or are simply corrupted.

This section will demonstrate how to look up unmatched terms and curating them using The CellMarker entity. First, we create an example Pandas DataFrame containing a few valid and invalid cell markers (antibody targets) and features (Time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7x",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cell_marker = CellMarker()

First, we can have a look at the cell marker table that we just loaded.

df = cell_marker.df()
df.head()
id name ncbi_gene_id gene_symbol gene_name uniprotkb_id synonyms
0 CM_MERTK MERTK 10461 MERTK MER proto-oncogene, tyrosine kinase Q12866 None
1 CM_CD16 CD16 2215 FCGR3A Fc fragment of IgG receptor IIIb O75015 None
2 CM_CD206 CD206 4360 MRC1 mannose receptor C-type 1 P22897 None
3 CM_CRIg CRIg 11326 VSIG4 V-set and immunoglobulin domain containing 4 Q9Y279 None
4 CM_CD163 CD163 9332 CD163 CD163 molecule Q86VB7 None

Now let’s check which cell markers from the file can be found in the reference. We do this using the .curate function:

cell_marker.curate(markers)
✅ 10 terms (71.4%) are mapped.
🔶 4 terms (28.6%) are not mapped.
orig_index __curated__
Ki67 KI67 True
CCR7x CCR7x False
CD14 CD14 True
CD8 CD8 True
CD45RA CD45RA True
CD4 CD4 True
CD3 CD3 True
CD127 CD127 True
PD-1 PD1 True
Invalid-1 Invalid-1 False
Invalid-2 Invalid-2 False
CD66b CD66b True
SIGLEC8 Siglec8 True
Time Time False

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

Note, certain markers will be converted to synonyms such as PD1 -> PD-1.

We don’t really find CCR7x, let’s check in the lookup with auto-completion:

cell_marker_lookup = cell_marker.lookup()
https://d33wubrfki0l68.cloudfront.net/eee08aab484a13dbaefc78633d1805ee61cd933c/8d864/_images/lookup_ccr7.png
cell_marker_lookup.CCR7
cell_marker(index=163, id='CM_CCR7', name='CCR7', ncbi_gene_id='1236', gene_symbol='CCR7', gene_name='C-C motif chemokine receptor 7', uniprotkb_id='P32248', synonyms=None)

Indeed we find it should be CCR7, we had a typo there with CCR7x.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CCR7x": cell_marker_lookup.CCR7.name})

OK, now we can try to run curate again and all cell markers are linked!

cell_marker.curate(curated_df)
✅ 11 terms (78.6%) are mapped.
🔶 3 terms (21.4%) are not mapped.
orig_index __curated__
Ki67 KI67 True
CCR7 CCR7 True
CD14 CD14 True
CD8 CD8 True
CD45RA CD45RA True
CD4 CD4 True
CD3 CD3 True
CD127 CD127 True
PD-1 PD1 True
Invalid-1 Invalid-1 False
Invalid-2 Invalid-2 False
CD66b CD66b True
SIGLEC8 Siglec8 True
Time Time False