Learning a functional grammar of protein domains using natural language word embedding techniques
Abstract
In this paper, using word2vec, a widely-used natural language processing method, we demonstrate that proteins domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words". Using all InterPro [1] pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam [2] Domains of Unknown Function. This article is protected by copyright. All rights reserved.
Journal details
Volume 88
Issue number 4
Pages 616-624
Available online
Publication date
Full text links
Publisher website (DOI) 10.1002/prot.25842
Europe PubMed Central 31703152
Pubmed 31703152
Keywords
Type of publication