Why would a 2019-nCoV protein sequence in the NCBI database match a protein submitted in 2018?

Why would a 2019-nCoV protein sequence in the NCBI database match a protein submitted in 2018?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

There seems to be a bit of a conspiracy theory brewing over some data in the NCBI database, and I don't have the necessary knowledge to make sense of it.

It basically goes like this:

  1. Go to NCBI BLAST
  2. Click on the big Protein BLAST button
  3. Enter AVP78033 in the main search box and click BLAST
  4. Click on the first result that shows a 100% match and click "See 5 more title(s)" in the first entry

This shows that the search is a complete match for a Bat SARS-like coronavirus protein from a 2018 research paper, for Wuhan seafood market pneumonia virus (which the NCIS site indicates is an alias for 2019-nCoV), and for Bat coronavirus from 29 Jan 2020.

My question is - why would a protein from Bat SARS-like coronavirus and 2019-nCoV be showing up as a perfect match for one another? Does this mean that 2019-nCoV might actually be a previously-discovered coronavirus that very recently started infecting humans? Or could it be that a recently collected sample from Wuhan was mis-identified as 2019-nCoV when it is actually the same coronavirus from the 2018 submission?

Clicking around the links on that site seem to bring up dozens of similar but different pages that I don't have the knowledge to distinguish, but the Accession column from the search results described above contains a link to this page, which says that it is a provisional refseq and acknowledges that it is identical to the bat coronavirus:

PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence is identical to QHD43418. Annotation was added using homology to SARSr-CoV NC_004718.3.

Can somebody who actually understands these things please make sense of this?

2019-nCoV is a virus that originated from the bat (at least this is the current hypothesis). It shows 96% squence similarity to the BatCoV RaTG13 sequence (see reference 1), showing its origin.

It still is 87,99% identical to the "Bat SARS-like coronavirus", which explains the hit you found and is not unexpected, as these viruses are very closely related (see reference 2).

These viruses are closely related, so I wouldn't expect too many differences at all. Then envelope proteins can be critical for function/structure of the virus, so mutations there might occur less frequent. And if they occur, I would only expect few changes over time, so with this little time gone, probably no mutation is seen yet. Additionally, due to the redundancy caused by the codon degeneracy, not every mutation in the genomic material translates into changes in the protein.


  1. Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event
  2. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Why would a 2019-nCoV protein sequence in the NCBI database match a protein submitted in 2018? - Biology

(The server completed predictions for 627218 proteins submitted by 151212 users from 158 countries)
(The template library was updated on 2021/06/22)

I-TASSER (Iterative Threading ASSEmbly Refinement) is a hierarchical approach to protein structure prediction and structure-based function annotation. It first identifies structural templates from the PDB by multiple threading approach LOMETS, with full-length atomic models constructed by iterative template-based fragment assembly simulations. Function insights of the target are then derived by re-threading the 3D models through protein function database BioLiP. I-TASSER (as 'Zhang-Server') was ranked as the No 1 server for protein structure prediction in recent community-wide CASP7, CASP8, CASP9, CASP10, CASP11, CASP12, CASP13, and CASP14 experiments. It was also ranked the best for function prediction in CASP9. The server is in active development with the goal to provide the most accurate protein structure and function predictions using state-of-the-art algorithms. Please report problems and questions at I-TASSER message board and our developers will study and answer the questions accordingly. (>> More about the server . )

Due to power outage and webserver failure, the I-TASSER is currently unavailable for use. We are working on restoring the system which will come back around the first week of March. We apologize for the inconvenience that this may cause.

Why would a 2019-nCoV protein sequence in the NCBI database match a protein submitted in 2018? - Biology

8 hours due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** DAVID will be down from 5pm EST Friday 6/24/2011 to 3pm EST Sunday 6/26/2011 due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** We are currently accepting Beta users for our new DAVID Web Service which allows access to DAVID from various programming languages. Please contact us for access. *** --> *** The Gene Symbol mapping for list upload and conversion has changed. Please see the DAVID forum announcement for details. --> *** Announcing the new DAVID Web Service which allows access to DAVID from various programming languages. More info. *** --> *** DAVID 6.8 will be down for maintenance on Thursday, 2/23/2016, from 9AM-1PM EST *** -->
*** Welcome to DAVID 6.8 ***
*** If you are looking for DAVID 6.7, please visit our development site. ***
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** If you are looking for DAVID 6.7, please visit our development site. ***
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** The DAVID 6.7 server is currently down for maintenance. ***
--> *** Please read: Due to data center maintenance, DAVID will be offline from Friday, June 17th @ 4pm EST through Sunday, June 19th with the possibility of being back online sooner. *** -->

Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank

The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.

Figure 1. Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i. The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.

According to the CDC, as of January 11, Chinese health authorities say they’ve identified more than 40 human infections as part of this outbreak that was first reported on December 31. The World Health Organization announced the preliminary identification of the novel coronavirus on January 9. The GenBank record of Wuhan-Hu-1 includes sequence data, annotation and metadata from this virus isolated approximately two weeks ago from a patient believed to have contracted the disease in a Hubei province seafood market.

Rapid access to sequence data from public databases such as GenBank plays a vital role in helping countries develop specific diagnostic kits for disease outbreaks like this one.

Data Description

Genome Organization of Four Coronaviruses

All selected coronaviruses have similar genome organization with coding genes of spike (S protein), envelope (E protein), membrane (M protein), nucleoprotein (N protein), and several open reading frames. SARS-CoV, 2019-nCoV, MERS-CoV, and RaTG13-CoV express 9, 8, 10, and 9 non-redundant protein coding genes, respectively (Figure 1A). In SARS-CoV, orf3b is overlapped with orf3a and E gene, orf7b is overlapped with orf7a, orf8b is overlapped with orf8a, and orf9b is part of orf9a (N gene). In 2019-nCoV, only orf7b is overlapped with orf7a and other genes are separated. In MERS-CoV, the orf4b is overlapped with orf4a and orf8b is part of N gene. In RaTG13-CoV, ns7b and ns7a are overlapped.

Figure 1 Genome organization of four coronaviruses and the characterization of predicted B/T-cell epitopes. (A) Genome organization of SARS-CoV, 2019-nCoV, MERS-CoV, and RaTG13-CoV. (B) The distribution of the predicted B/T epitopes of E, M, N, and S across four coronaviruses. (C) The relationship between protein length and the number of predicted B/T-cell epitopes. The snowflake indicates the number of predicted amino acids for B-cell epitopes, the circle indicates the number of predicted T-cell epitopes presented by HLA I alleles, and the triangle indicates the number of predicted T-cell epitopes presented by HLA II alleles. (D) The example proteins that have abnormal protein length and the number of B/T-cell epitopes relationship.

Characterization of Predicted B/T-Cell Epitopes

Though some genes are overlapped, we predicted the potential B/T-cell epitopes of all genes because overlapped genes encode different proteins. Results show that the number of the predicted epitopes is different but similar among the homologous proteins of four coronaviruses (Figure 1B and Supplementary Table 3). Take the S protein as example, average 444 peptides are predicted as epitopes presented by HLA I alleles among four coronaviruses. The most is the S protein in MERS-CoV which occupies 482, the least is that in RaTG13 which occupies 423. Average 1,615 peptides are predicted as epitopes presented by HLA II alleles. The most is the S in MERS-CoV which occupies 1,804, the least is that in 2019-nCoV which occupies 1471. Average 323 amino acids are predicted as part of B-cell epitopes. The most is the S protein in 2019-nCoV which occupies 359, the least is that in SARS-CoV which occupies 279. The difference of predicted B/T-cell epitopes is minor in S. In other homologous genes, similar phenomenon occurs.

Normally, the number of predicted B/T-cell epitopes is positive correlated with the length of the proteins which genes translate (Figure 1C). However, there are also some exceptions that longer gene with less number of predicted B/T-cell epitopes, such as the M protein compared with the N protein in 2019-nCoV (Figure 1D). With nearly half length of encoded protein, M protein possesses more T-cell epitopes presented by both HLA I alleles and HLA II alleles than N protein, which indicates that M protein is preferred to be recognized by T cells than N protein. Besides, all proteins have predicted epitopes presented by HLA II alleles except ORF8a in SARS-CoV, which might be ascribed to its short length and less immunogenicity.

For better visualization of the predicted B/T cell epitopes, we create a database named COVIEdb ( With four main pages 𠇋-epitope”, “T-epitope”, “Peptide”, and “Validated”, researchers could find useful information easily and quickly. The predicted results of B-cell epitopes could be searched in 𠇋-epitope” page. With the virus and gene selected, the corresponding predicted B-cell epitopes would appear. The predicted results of T-cell epitopes could be searched in “T-epitope” page. Similar with that in 𠇋-epitope” page, coronavirus and protein are necessary. Besides, the type of T-cell epitopes should also be selected. Only the peptide-HLA pairs which satisfied thresholds of all tools would be showed in this page. The searchable data in the “Peptide” page is the combined result of previous predicted B-cell epitopes and T-cell epitopes. In this page, the only selectable parameter is the protein. The “Validated” page containing the predicted B/T epitopes that have been validated by recently literatures (Le Bert etਊl., 2020 Zhang B. Z. etਊl., 2020). To date, there are only 116 validated epitopes in the “Validated” page. However, with the growing research on coronaviruses, more validated data would be added to the “Validated” page.

Shared B/T-Cell Epitopes

Though the evolution rate of human coronavirus is fast, we try to find out B/T-cell epitopes conserved and shared in different coronaviruses for the pan-coronavirus vaccine development. Based on the predicted B-cell epitopes and T-cell epitopes, we found 77 peptides that exist in all coronaviruses have the potential to induce T-cell activation and 10 of them with B_score larger than 4 (Table 1 and Supplementary Table 4). In particular, the peptide YFKYWDQTY from ORF1ab could be presented by 7.33% people, which might be a good candidate for vaccine design.

Table 1 The potential T-cell epitopes with B_score larger than 4.

All the T-cell epitopes shared in four coronaviruses are located in ORF1ab. However, the S protein of the coronavirus is the most important protein where the receptor binding domain (RBD) located. So, we further investigated the shared epitopes that located in S protein. There are 265 potential epitopes in S protein shared by three coronaviruses and 35 of them with B_score larger than 5 (Supplementary Table 5). The peptides VYDPLQPEL and TVYDPLQPEL even have B_score larger than 6. To note, though these two peptides differs only one amino acid, the HLA alleles that can bind with them are different. VYDPLQPEL can be presented by HLA-C07:02, HLA-C04:01, and HLA-C14:02, with overall 8.26% frequency in Chinese Han population, while TVYDPLQPEL can be presented by HLA-A02:06 and HLA-C12:03, with 2.44% frequency. The two peptides are different in the aspect of epitopes, but we could take them as one when choosing the vaccine target, which indicates the feasibility of the peptides to be potential pan-coronavirus vaccine target.

We believe that these results and the developed database could benefit not only the vaccine (especially the multiple-epitope vaccine which could protect from various coronavirus) development but also provide the targets for drug design such as neutralizing antibody on 2019-nCoV and the possible coronavirus outbreak in the future.


In this work, we employed our previous data mining methodology [22] to identify potential functional motifs but applied to MERS-CoV and SARS-CoV/CoV-2 viruses. The main advantage of this method is the search restricted to human protein targets involved in the virus pathogenesis. The initial step allows us to reduce a priori the query on the 3DID and ELM databases. As a result, the unsheathed domain-motif information is potentially associated with human genes related to pathogenesis of the MERS-CoV and SARS-CoV/CoV2. Our approach is then similar to the methods used by Hagai, T., et al., Becerra, A. et al and Zhang, A et al [29, 39, 40] in predicting functional motifs. These methods include some distinctive features such as predicting disordered regions on the protein, the high frequency of amino acid motifs in the protein sequences datasets under study, and the scarcity of amino acid motifs on shuffled sequences. The filters were tailored according to the information obtained in each data mining process. All those filtered steps guided our analysis to a more specificity that linked the predicted functional motifs as part of immune epitopes as previously we did for influenza A viruses [22]. It is distinctive of our prediction approach, because it was used to reduce the high rate of false positives associated with the computational prediction of motifs [41]. Furthermore, our method could be an alternative for computer-aided reverse vaccinology.

One interesting result is that the tendency of matched motifs occurred in the most variable proteins, the ORF1ab, and the S protein of the coronavirus proteomes. The ORF1ab contains the nonstructural proteins responsible for the translation machinery of viruses in the intracellular environment [42] and the S protein is essential for the virus’s attachment to the host cell [43]. The tendency of motifs to appear on the proteins involved in virus replication was also observed in influenza viruses [44]. Thus, the high frequency of host-like motifs in those viral proteins suggests that such proteins could be the master kidnappers. Another finding is the high number of shared motifs across the proteome or distinct proteins of a proteome, reflecting the viral motifs to evolve independently in light of acquiring host-like mechanisms for the success in the invasion of host cells.

The domain enrichment analysis showed that the general biological processes, and molecular functions could be the consequence of the MERS-CoV and SARS-CoV/CoV-2 mimicry to hijack the host cell. The most significant ontology terms are the energy-saving and glycogen biosynthesis metabolism association. This result agrees with that viruses use the infected cells’ carbon sources to achieve viral replication and virion production [45]. It is reasonable that glycogen, a storage form of glucose, is utilized in unexpected, exhausting cell activity [46] as infected. On the other hand, as this biosynthetic pathway is vital for the viruses’ survival, targeting essential components such as the glycogen synthase kinase could help treat virus infections. It was reported that the use of two glycogen synthase inhibitors altered the hepatitis C virus assembly and release [47]. Hence, the proteins we found in the present study could be used to explore them as drug targets.

In another context, motifs have been suggested as potential immunogens [41]. It took our attention to search motif that matched with immune epitopes. Indeed we found that some motifs matched to the epitopes on the IEDB. Some of them were nested on the epitopes of earlier SARS-CoV and also present on those new SARS-CoV-2. It reaffirms the evidence of cross-reactive immune responses to coronavirus infections by SARS-CoV and SARS-CoV-2 [48–51]. Additionally, our study identified the epitopes harboring motifs that could interact with human protein domains. It is quite relevant because such domain-motifs shared in the different coronavirus can trigger a common molecular mimicry process that could lead to autoimmune diseases. It was demonstrated that antibodies derived from Flu vaccinated patients react with homologous sequences of the nucleoprotein of influenza A virus and the hypocretin receptor 2 domain of humans, the latter of which was involved in narcolepsy, an autoimmune adverse effect attributed to the Flu-vaccine [52]. Influenza immunization is also attributed to Guillain-Barré syndrome [53], a disease in which its pathogenesis is associated with several bacterial and viral pathogens’ molecular mimicry [54–56]. Thus, our results are vital to helping in the currently underway rational vaccine development efforts, mainly because several autoimmune diseases have been associated with COVID-19 [57].

Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag

We are currently witnessing a major epidemic caused by the 2019 novel coronavirus (2019-nCoV). The evolution of 2019-nCoV remains elusive. We found 4 insertions in the spike glycoprotein (S) which are unique to the 2019-nCoV and are not present in other coronaviruses. Importantly, amino acid residues in all the 4 inserts have identity or similarity to those in the HIV-1 gp120 or HIV-1 Gag. Interestingly, despite the inserts being discontinuous on the primary amino acid sequence, 3D-modelling of the 2019-nCoV suggests that they converge to constitute the receptor binding site. The finding of 4 unique inserts in the 2019-nCoV, all of which have identity /similarity to amino acid residues in key structural proteins of HIV-1 is unlikely to be fortuitous in nature. This work provides yet unknown insights on 2019-nCoV and sheds light on the evolution and pathogenicity of this virus with important implications for diagnosis of this virus.


AsianScientist (Feb. 25, 2020) – Scientists in China have sequenced the genome of the COVID-19 virus demonstrating that it is a completely new virus, albeit closely related to the coronavirus (CoV) responsible for severe acute respiratory syndrome (SARS). Their findings are published in the journal Chinese Medical Journal.

In early December 2019, people in the city of Wuhan in the Hubei province of China began falling sick after going to a local seafood market. They experienced symptoms like cough, fever, shortness of breath and complications related to acute respiratory distress syndrome. The immediate diagnosis was pneumonia, but the exact cause was unexplained.

In the present study, researchers led by Dr. Wang Jianwei at the Chinese Academy of Medical Sciences, Institute of Pathogen Biology, China, used next generation sequencing (NGS) to definitively identify the pathogen causing illness in Wuhan. They focused on five patients admitted to Jin Yin-tan Hospital in Wuhan, most of whom were workers in the Huanan Seafood Market in Wuhan.

The scientists first obtained bronchoalveolar lavage (BAL) fluid samples taken from the patients, isolated the DNA and RNA, then sequenced the genetic material. Most of the viral sequences belonged to the CoV family of viruses, which includes the SARS-CoV and the Middle East respiratory syndrome-related (MERS) CoV.

The researchers then constructed the whole genomic sequence of the new virus—now known as COVID-19—and found that its genome sequence is 79 percent similar to the SARS-CoV, about 51.8 percent similar to the MERS-CoV, and about 87.6-87.7 percent similar to other SARS-like CoVs from Chinese horseshoe bats (called ZC45 and ZXC21). These findings clearly suggest that the virus originated from bats.

This study paves the way for future studies to understand the virus and its sources better, said the researchers. Although four of the five patients from whom this virus was identified were associated with a seafood market in Wuhan, the exact origin of infection is unknown. The CoV could have been transmitted to humans through an intermediate carrier, such as in the case of SARS-CoV (palm civet meat) or MERS-CoV (camel).

“All human CoVs are zoonotic, and several human CoVs have originated from bats, including the SARS- and MERS-CoVs. Our study clearly shows the urgent need for regular monitoring of the transmission of bat-origin CoVs to humans,” Wang said.

“The emergence of this virus is a massive threat to public health, and therefore, it is of critical importance to understand the source of this virus and decide the next steps before we witness a larger scale outbreak,” he added.

Source: Chinese Academy of Medical Sciences, Institute of Pathogen Biology Photo: Shutterstock.
Disclaimer: This article does not necessarily reflect the views of AsianScientist or its staff.

Functions of the S protein

The S protein on the surface of the virus is a key factor involved in infection. It is a trimeric class I TM glycoprotein responsible for viral entry, and it is present in all kinds of HCoVs, as well as in other viruses such as HIV (HIV glycoprotein 160, Env), influenza virus (influenza hemagglutinin, HA), paramyxovirus (paramyxovirus F), and Ebola (Ebola virus glycoprotein) [30]. Similar to other coronaviruses, the S protein of SARS-CoV-2 mediates receptor recognition, cell attachment, and fusion during viral infection [16, 20, 21, 31,32,33].

The trimer of the S protein located on the surface of the viral envelope is the basic unit by which the S protein binds to the receptor [16, 33]. The S1 domain contains the RBD, which is mainly responsible for binding of the virus to the receptor, while the S2 domain mainly contains the HR domain, including HR1 and HR2, which is closely related to virus fusion [34].

Receptor binding

As mentioned above, the SARS-CoV-2 S protein binds to the host cell by recognizing the receptor ACE2 [33]. ACE2 is a homolog of ACE, which converts angiotensin I to angiotensin 1–9 [35]. ACE2 is distributed mainly in the lung, intestine, heart, and kidney, and alveolar epithelial type II cells are the major expressing cells [36]. ACE2 is also a known receptor for SARS-CoV. The S1 subunit of the SARS-CoV S protein binds with ACE2 to promote the formation of endosomes, which triggers viral fusion activity under low pH (Fig. 1a, b) [37].

Interaction between the S protein and ACE2 can be used to identify intermediate hosts of SARS-CoV-2, as ACE2 from different species, such as amphibians, birds, and mammals, has a conserved primary structure [38]. Luan et al. compared the binding affinities between ACE2 and SARS-CoV-2 S from mammals, birds, snakes, and turtles and found that the ACE2 of Bovidae and Cricetidae interacted well with SARS-CoV-2 S RBD but that ACE2 from snakes and turtles could not.

The S protein binds to ACE2 through the RBD region of the S1 subunit, mediating viral attachment to host cells in the form of a trimer [15]. SARS-CoV-2 S binds to human ACE2 with a dissociation constant (KD) of 14.7 nM, though that of SARS-CoV S is 325.8 nM [15], indicating that SARS-CoV-2 S is more sensitive to ACE2 than is SARS-CoV S. Through the identification of SARS-CoV-2 proteins, researchers found

24% difference in S between SARS-CoV-2 and SARS-CoV, whereas that of RBD is

Viral fusion

Viral fusion refers to fusion of the viral membrane and host cell membrane, resulting in the release of the viral genome into the host cell. Cleavage of the SARS-CoV-2 S1 and S2 subunits is the basis of fusion. The S protein is cleaved into two parts, the S1 subunit and S2 subunit, by host proteases, and the subunits exist in a noncovalent form until viral fusion occurs [40]. Researchers have found that the specific furin cleavage site is located in the cleavage site of SARS-CoV-2 but not in other SARS-like CoVs [41, 42]. Mutation of the cleavage site in SARS-CoV-2 or SARS-like CoVs has revealed that the S protein of SARS-CoV-2 exists in an uncleaved state but that the others are mainly in a cleaved state. SARS-CoV-2 S has multiple furin cleavage sites, which increases the probability of being cleaved by furin-like proteases and thereby enhances its infectivity [43, 44]. The furin-like cleavage domain is also present in highly pathogenic influenza virus and is related to its pathogenicity, as observed in the avian influenza outbreak in Hong Kong in 1997 [45, 46]. In addition, host cell proteases such as TMPRSS2 are essential for S protein priming, and they have been shown to be activated in the entry of SARS-CoV and influenza A virus [18, 47, 48]. Another host cell protease that has been proven to cleave viral S protein is trypsin [49]. In summary, the S protein of SARS-CoV-2 is similar to that of SARS-CoV, and host cell proteases are essential for promoting S protein cleavage of both SARS-CoV-2 and SARS-CoV. The presence of a specific furin cleavage site on SARS-CoV-2 S might be one reason that SARS-CoV-2 is more contagious than SARS-CoV.

The formation of 6-HB is essential for viral fusion. The FP in the N-terminus of SARS-CoV-2 and the two HR domains on S2 is essential for viral fusion [50]. After cleavage of the S protein, the FP of SARS-CoV-2 is exposed and triggers viral fusion. Under the action of some special ligands, the fusion protein undergoes a conformational change and then inserts into the host cell membrane (Fig. 1c) [51]. For example, the ligand for influenza A virus is H + , while the ligand for HIV is a coreceptor such as CCR5 or CXCR4 [14]. The distance between the viral membrane and host cell membrane is shortened, and the HR1 domain of the S protein is in close proximity to the host cell membrane, whereas the HR2 domain is closer to the viral membrane side. Then, HR2 folds back to HR1, the two HR domains form a six-helix structure in an antiparallel format of the fusion core, the viral membrane is pulled toward the host cell membrane and tightly binds to it, and the two membranes fuse [52].


Data and software result public access

An open access, persistent repository of this annotated pig gene data set is at with DOI 10.5967/K8DZ06G3. Transcriptome Shotgun Assembly accession is DQIR01000000 at DDBJ/EMBL/GenBank, BioProject PRJNA480168, for these annotated transcript sequences. Preliminary gene set is at EvidentialGene software package is available at and at

The results of gene assembly for each of 4 data sources are summarized as pig1a 11,691,549 assemblies reduced to 595,497 non-redundant coding sequences (5%), pig2b 3,984,284 assemblies reduced to 404,908 (10%), pig3c 8,251,720 assemblies reduced to 564,523 (7%), and pig4e, a smaller embryo-only RNA set, of 1,955,018 assemblies to 134,156 (7%). These 4 reduced assemblies are then used in secondary runs of SRA2Genes, starting with these as input transcripts. Secondary runs were performed as noted in Methods, with reference homology assessment, to ensure all valid homologs are captured. Some fragment gene models were successfully improved by additional assembly with rnaSPAdes (16,168 or 5% of final transcripts, including 1,571 loci with best homology). Supplemental Archive 4 contains scripts generated by SRA2Genes and used to assemble, reduce, annotate and check sample pig1a on cluster compute system these are also available in the above noted repository.

The final gene set is summarized in Table 1 by categories of gene qualities and evidences. Only coding-sequence genes are reported here. The number of retained loci include all with measurable homology to four related vertebrate species gene sets, and a set of non-homologs, but expressed with introns in gene structure, two forms of gene evidence that provide a reliable criterion. The number with homology is similar to that of RefSeq genes for pig. The expressed, multi-exon genes add 15,000 loci, which may be biologically informative in further studies. The pig RefSeq gene set has 63,586 coding-sequence transcripts at 20,610 loci, of which 5,177 CDS at 5,056 loci have exceptions to chromosome location (indels, gaps, and RNA/DNA mismatch). Non-coding genes are not reported in this Evigene pig set as they lack strong sequence homology across species and are more difficult to validate.