|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Papers In Press, published online ahead of print February 1, 2007 J. Lipid Res., doi:10.1194/jlr.M600372-JLR200
Journal of Lipid Research, Vol. 48, 434-443, February 2007
High-density genotyping and functional SNP localization in the CETP gene
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
Supplementary key words genetics high density lipoprotein single nucleotide polymorphism cholesteryl ester transfer protein
| INTRODUCTION |
|---|
|
|
|---|
In addition to the numerous amino acid variants that have been detected in CETP (5), there is also evidence that promoter SNPs are even more significantly associated with HDL-C than those that change a single amino acid. Common promoter polymorphisms at positions 629 and 971 and a variable repeat sequence have all been reproducibly associated with CETP and/or HDL-C levels (610). These associations at the 5' end of the gene are stronger than those observed with SNPs causing amino acid changes at the 3' end of the gene. The functionality of the 629 SNP has been linked to changes in an Sp1/Sp3 binding site (7, 11), whereas the SNP at 971 was shown not to affect transcription (8). No results have been reported on the functional role of the variable repeat sequence. Large segments of the promoter have been fused to reporter genes to determine functional regions, but only changes at positions 629 and 38 have been examined as a function of naturally occurring polymorphisms (7, 11).
Linkage disequilibrium (LD) within CETP makes the identification of functional SNPs difficult. When only low-density genotypic information was available, genome structure was generally approximated by linear collections of haploblocks. Within the CETP gene, initial studies showed two haploblocks, one covering the promoter region and the 5' half of the gene and the second covering the 3' half of the gene and including many nonsynonymous SNPs (6, 10, 12). As higher density SNP information has become available, it became apparent that a linear collection of haploblocks is a poor approximation of genome structure. An improved description consists of multiple LD bins that are interspersed along the linear DNA sequence, with each bin containing a distinct subset of SNPs (13). These bins are described empirically and, for any given region of the genome, the extent and number of LD bins vary across different ethnicities. Typically, there are cutoffs for both minor allele frequency (MAF; 5%) and extent of LD (R2 > 0.8) to define SNPs within each bin (13).
We have previously published association studies with 20 SNPs in and near the CETP gene with
2,500 individuals (5, 10, 11, 14, 15). Although this has provided significant insight into the functional nature of CETP SNPs, there are still many unanswered questions about the role of different SNPs, how they are linked to each other, and the detailed genomic structure surrounding CETP. To help answer some of these questions, we have genotyped 63 additional SNPs in >2,000 individuals and integrated that data with other sources of information to generate a highly detailed map of CETP and its association with HDL-C.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Genomic DNA was extracted from whole blood using the PureGene DNA isolation system (Gentra) according to the manufacturer's protocol. Some SNPs discussed here were first reported elsewhere (5, 10, 11, 14, 15) and genotyped as described in those publications. SNPs reported for the first time here were genotyped using either TaqMan or SNPlex technology according to the manufacturer's instructions (Applied Biosystems, Foster City, CA).
Statistical analysis
The goal of the statistical analysis was to test for significant genetic associations between HDL-C levels and CETP SNPs across European (n = 3,129) and African (n = 420) subjects. A small number of Asian subjects (n = 36) were also used for comparison of fitted genotype effects, although this population was considered too small for hypothesis testing. The largest cohort studied was from ACCESS and included individuals with European (2,465), African-American (170), and Asian (36) ancestry, after removing subjects who were outliers (beyond 5 sigma) and/or who had missing data for critical phenotypes. Demographic information for the genotyped individuals is listed in Table 1
.
|
The model used had log(HDL-C) as the response and genotype (coded as a three-level factor) as the main effect to be tested for. Explanatory covariates, all of which were significant against log(HDL-C), were as follows: age, gender, ethnicity, cohort, and alcohol consumption (coded as a three-level factor on weekly alcohol consumption: no drinks, 1 to <10 drinks, and 10 or more drinks, based on the distribution of values).
A generalized least-squares model was used, allowing for the heterogeneous variance components described above. Genotype significance for individual ethnicities was evaluated by likelihood comparisons of the full model with one with the genotypes of the targeted ethnicity assigned to a single factor, using a Chi-square test. The final model used for overall significance was
![]() | (1) |
The hypothesis tests were validated with a permutation test. Specifically, because many of the SNPs tested were in high LD with each other (and hence were far from independent), multiple-testing adjustment was performed by comparing the rank-ordered hypothesis test results against 5,000 permutations in which sample IDs were permuted within each ethnicity and then remerged to the full set of genotype values, thus preserving both the underlying LD structure and the explanatory covariates as they related to HDL-C while still simulating the null hypothesis for genotype effects. The adjusted P value for each SNP is the maximum (worst) of the false discovery rate so calculated and the point estimate of its P value from the individual permutation test.
In generating effects plots, the Asian population was merged with the African and European populations, then the entire data set was fit to a null model (i.e., one without a genotype term), and the residuals were plotted by ethnicity and genotype using box and whisker plots. Genotypes are denoted by A, B, or C, with A denoting the wild-type homozygotes (as defined by empirical observation of the pooled population), B denoting the heterozygotes, and C denoting the homozygotes in the minor allele, still based on empirical observation over all subjects. These definitions remained fixed across ethnicities, even if the MAF crossed the 50% barrier going from one ethnicity to another.
| RESULTS |
|---|
|
|
|---|
Only two sets of genotype data published to date span the entire 50 kb region of interest, HapMap (http://www.hapmap.org/cgi-perl/gbrowse/hapmap20_B35/) and those SNPs published by Hinds et al. (13) (http://genome.perlegen.com/browser/index.html). Within this 50 kb region, there are 103 HapMap SNPs with nonzero MAF, including a subset of 87 with a frequency of >5% in at least one ethnic group. Similarly, there are 33 SNPs reported by Hinds et al. (13) (all nonzero MAF) in this region. Thirteen of these 33 SNPs are not genotyped in the HapMap set. Using the individual genotype data from the data sets for which they are available, an initial set of LD bins was determined as described (13). The resulting bins were compared across data sets to the extent possible to determine which bins could be combined. Because all of these data sets have a limited number of individuals and none includes complete sequence information over the entire region, many of the LD bins are poorly defined. To better understand the nature of these bins and to collapse them into a smaller set, SNPs were chosen from across the region for analysis in a large, multi-ethnic population. Genotype data from the original databases and the literature were generated using a variety of techniques. Some SNPs are not amenable to genotyping by particular technologies, and not all could be assayed by the SNPLex technology used here, preventing us from obtaining a complete set of genotypes. To the extent possible, some SNPs unique to each data set were included so that comparison across studies could be accomplished.
All SNPs that we genotyped were in Hardy-Weinberg equilibrium with P > 0.05, with the exception of one SNP each in individuals of European and African ancestry. Both of these SNPs were P > 0.03 and thus within the expected range of normal variation for a study with this number of SNPs tested. Of the 103 published HapMap SNPs, we generated more in-depth data for 40 of them, including 38 with MAF > 5% in at least one population. Of the 33 Hinds et al. (13) SNPs, we generated data for 29 of them. By generating genotypes for thousands of individuals that includes SNPs from each of these sets of individuals, we are able to establish LD bins representing SNPs spanning all of these data sets. A summary of the allele frequency for each ethnicity for polymorphisms with MAF > 5% for which we have generated data (reported previously or here for the first time) is shown in Fig. 1 . For completeness, uncommon SNPs and an additional 101 SNPs for which others have published individual genotype data are also included in supplementary Table I. Some SNPs have multiple dbSNP identifiers, and we have used the one chosen by National Center for Biotechnology Information and listed alternative numbers. Even though the group size for some populations is relatively small (20 to 50 individuals), the minor allele frequencies are consistent across studies within the same ethnic group (see supplementary Table I).
|
|
When we define LD bins using our data, the SNPs grouped together are very consistent within an ethnic group compared with those generated by Hinds et al. (13) or using the same definitions with HapMap data, even though both sets had far fewer individuals. The only discrepancies found with either data set are with SNPs that are very close to either the MAF or R2 cutoff in one population or the other. Otherwise, there is perfect agreement for LD bin composition.
Using the cutoffs of MAF > 5% and R2 > 0.8 for LD bin determination and all available data, 102 SNPs can be placed in 49 bins for individuals of European ancestry. The most populated bin contains eight SNPs, and the largest span covered by a single bin is 10,731 bp. For HapMap samples, 40 tagging SNPs representing 59 SNPs are identified. Nearly identical bins are generated with only two HapMap tagging SNPs falling into the same bin as defined here. When the additional 43 SNPs not found in HapMap are added, this required only an additional nine bins, confirming that there is a point of diminishing returns for genotyping but that genotyping at a frequency of greater than one SNP per kilobase still has the potential to generate useful data.
For individuals of African ancestry, there are 97 SNPs in 66 bins. The most populated bin contains seven SNPs, and the largest spans 6,621 bp. The larger number of bins and shorter extent of DNA sequence covered by each bin in those of African-American versus European ancestry are similar to what has been observed previously (13). Our bins match well with the African Americans characterized by Hinds et al. (13) but poorly with the Nigerian HapMap data. This highlights the difficulties of comparing across populations and ensuring that ancestry is matched appropriately. We have not attempted a comparison of our Asian ancestry bins with others because of the small number of individuals genotyped.
Associations with HDL-C
In addition to characterizing the genomic structure surrounding CETP, we also wanted to determine how the association with HDL-C was superimposed on that structure. Among individuals of European ancestry, the strongest associations are clearly in the promoter but span a very broad region. We have extended coverage to within 3 kb of the neighboring upstream gene, HERPUD1, 15 kb from the CETP transcriptional start. Our finding that a SNP >10 kb away (rs9989419) from the start site is associated with HDL-C suggests that distal interactions may play a role in regulating CETP levels. However, much of this association appears to arise from LD with nearby SNPs. All of the SNPs most highly associated with HDL-C among individuals of European ancestry are in LD bin 8. LD bins 6 and 10 are interspersed in this part of the promoter region but are orders of magnitude less significantly associated with HDL-C based on point estimates (Fig. 1). Several singleton SNPs in the promoter are also associated with HDL-C, but not as strongly as those in bin 8. One of the most highly associated SNPs, rs183130 in bin 8, has a consistent effect across ethnic groups, as shown in Fig. 3
. The sequence conservation for this region among primates and the sequence surrounding two other functional SNPs is shown in Fig. 4
. In each region, the chimpanzee sequence is identical to the human sequence, whereas other species have up to several changes. For each of the putative transcription factor binding sites, some of these nonconserved positions would be predicted to affect protein binding.
|
|
All associations were tested for gender effects as well. Two SNPs in individuals of European ancestry and one SNP in individuals of African-American ancestry yielded gender-genotype interactions with P values between 0.03 and 0.05, not significant after correction for multiple testing. Interactions between SNPs were also tested. Because of the number of SNPs and comparisons involved, this was done only for individuals of European ancestry in the ACCESS trial. Four pairs of SNPs show strong, nonadditive interactions, even after correction for multiple testing. The most significant interaction, between rs12920974 and rs4783961, has an uncorrected P value of 1.7 x 108, remaining significant even after correction for 2,616 tests. Both of these SNPs are in the promoter region. The HDL-C means for each combination of genotypes are shown in Fig. 5 . The other SNP pairs that remain significant after multiple testing correction all include rs7203286 (in the distal promoter region) in combination with rs820299 (intron 2), rs158477 (intron 9), or rs4783961 (promoter).
|
| DISCUSSION |
|---|
|
|
|---|
All common CETP variants have, at most, modest effects on either CETP mass or activity. If the most significantly associated SNP, rs183130, is examined, the mean HDL-C for the common versus rare homozygote varies only from 46.3 to 49.7 mg/dl among individuals of European ancestry, a difference of <10%. In contrast, both gender and alcohol consumption have larger effects on HDL-C, with European females in the ACCESS trial having significantly higher HDL-C (54 mg/dl) than European males (44 mg/dl). Similarly, European males who consume >10 drinks per week have higher HDL-C (50.7 mg/dl) than those who consume none (41.9 mg/dl). Only rare, nonfunctional CETP variants have a large impact on HDL-C, and these also have a protective effect with respect to disease in large, prospective studies (25). Thus, the impact of common CETP SNPs can be readily observed on a population basis, but these are of little value when examining small numbers of individuals.
It is possible to overinterpret the effect of CETP on HDL-C and attribute any large change in HDL-C to CETP variants of modest functional significance, even when known environmental effects such as exercise are present (26). Even the high HDL-C induced by extreme exercise, such as running marathons (27), is not always protective for cardiovascular disease. Risk of sudden cardiac death is generally attributable to a variety of defects in cardiac structural and channel proteins, with >50% of such deaths attributed to hypertrophic cardiomyopathy (28), independent of CETP genotype.
When data for only a limited number of SNPs were available, it was most convenient to describe the genomic structure of CETP and other genes as a series of large haploblocks. Initial work was consistent in showing that a large number of SNPs in the promoter and 5' region of the gene were in LD with each other, whereas another set of SNPs in the 3' region constituted another haploblock (6, 10, 12). This simplified view of the gene was useful as a rough approximation but is not accurate when comparing SNPs supposedly in the same haploblock but really having little linkage. With much more data now available, it is clear that the more detailed approach of using LD bins or tagging SNPs from nonlinear parts of the genome is necessary for an accurate view. Although there are many approaches and definitions that can be used for defining LD bins and tagging SNPs, similar results are obtained across populations with the same ancestry. The bins defined by Hinds et al. (13) with only 24 individuals were nearly identical to those defined by us with >2,000 individuals.
Horne et al. (21) identified a set of tagging SNPs for CETP that overlap the SNPs examined here. Despite using a very different approach, many SNPs represented by their tagging SNPs fall into separate bins, as defined here. The overlap is not perfect, but the same overall picture of the genomic structure is generated. However, comparison of our data with this set of tagging SNPs also highlights the need to characterize extensive regions of DNA sequence for a given gene. Their most distal SNP examined was only 631 bp from the start of transcription; thus, the LD bins with the most significant associations with HDL-C were not tagged.
The promoter SNP at 629 has been shown to affect Sp1 binding in vitro, reporter activity in cells, and CETP mass levels in humans (7, 10, 11). In contrast, the TaqIB SNP (rs708272) is among the most extensively studied, but there has been no indication that it exerts any functional effect. Although TaqIB is not in the same LD bin as the 629 SNP, it is in reasonably high LD (R2 = 0.72) in individuals of European ancestry. Furthermore, TaqIB is in the same LD bin as other SNPs that span a large region of the gene, including SNPs in intron 2 (R2 = 0.85), intron 5 (R2 = 0.81), and intron 7 (R2 = 0.91) in individuals of both European and Asian ancestry. Any of these SNPs could potentially have some functional effect, or it could arise from some other uncharacterized SNP in this 9-kb region. The TaqIB SNP is also associated with HDL-C in African Americans, but it is a singleton in terms of LD bins. Unlike individuals of European and Asian ancestry, in whom the most strongly linked SNPs are 3' to TaqIB, the SNPs most strongly linked to TaqIB among individuals of African ancestry are in the promoter with rs183130 (R2 = 0.68) and the variable number of tandem repeats (R2 = 0.64) being in tightest LD. This suggests that the functional SNP(s) assessed when TaqIB was examined in Africans is different from the functional SNP(s) assessed in Asians and Europeans.
Even though the promoter 629 SNP has been shown to have functional effects, it is clear that other SNPs in the promoter are also independently associated with HDL-C. The most highly associated SNPs are those in LD bin 8. However, among individuals of European ancestry, there are seven SNPs spread over 6,500 bp in bin 8, six of which we genotyped. Each could be examined on its own for functionality, but that is a challenging and not always fruitful endeavor. The availability of results from multiple ethnic groups makes it possible to decrease the number of potentially functional SNPs by taking advantage of the different LD structures. All six of the bin 8 SNPs we examined in Europeans are highly associated with HDL-C. Among Asians, all five of the SNPs in the homologous bin are also associated with HDL-C. Among Africans, the seven SNPs from bin 8 in Europeans are split into four separate bins, and only one of them is associated with HDL-C, rs183130 at position 4502. Although it is possible that there are distinct functional SNPs in the different ethnic groups, the consistent results (Fig. 3) with this SNP suggest that rs183130 may be functional. Additional experiments will be necessary to confirm this.
When 11 promoter SNPs that are associated with HDL-C among individuals of European ancestry are scanned for transcription factor binding sites (29), the only one that results in a change is rs183130. When a G is present on the bottom strand (GGGATTCTCC), an 8:10 match to the consensus site for nuclear factor
B (GGGGYNNCCY) described by Ghosh, May, and Kopp (30) and a 10:10 match to the consensus site (GGGRDTYYCC) described by Liu et al. (31) are found. The alteration from G to A creates a mismatch in both consensus sequences. The Liu et al. (31) consensus is particularly interesting in that it also appears to bind members of the Sp1 family of proteins that have been shown to be important in regulating CETP at the proximal promoter SNPs at 629 and 38 (7, 11).
Large meta-analyses of some CETP SNPs have been published, and these show a consistent but variable association with HDL-C, CETP mass/activity, and other phenotypes (2, 3). Results in studies with large numbers of individuals with European ancestry are in agreement with our results (supplementary Table I). Promoter SNPs rs12149545, rs4783961, and rs1800775 are significantly associated with HDL-C (6, 8, 32, 33). Similarly, results with large numbers of individuals of Asian ancestry found rs3764261 and the VNTR highly associated with HDL-C (9), as we found. All of these studies have focused on the promoter sequence within 3,300 bp of the transcriptional start site. Most functional characterization of the promoter has been restricted to the proximal 3,000 bp (34), with only limited analysis beyond that region (35).
In addition to the univariate analyses, nonadditive interactions between SNPs may also be important, as seen with other lipid-related genes (36). Our observation that the association of some promoter SNPs (rs4783961 at 971) is nonadditive with other SNPs mirrors previous findings (33) in which that SNP's function in vitro was dependent on other nearby SNPs. The fragment tested in vitro was only 1,707 bp long, so it did not include the SNP we found most significant, rs12920974 at 2,940, but the concordance of results clearly shows the complexity of genetic modulation of transcription.
The data provided here yield a means of comparing results across many studies by determining which SNPs are likely to yield similar information and which will not. The extensive genotyping also allows other investigators to compare their populations with an independent set to test whether differences in allele frequencies found in a case-control study might be attributable to problems with the control population rather than a true association. As stated above, the SNPs tested here are all in Hardy-Weinberg equilibrium, unlike some control populations described elsewhere. By examining the CETP gene in detail and across populations, we are able to predict which SNPs are likely to be functional. This approach is generalizable to other genes in which robust associations are found.
Manuscript received August 18, 2006 and in revised form November 10, 2006.
| REFERENCES |
|---|
|
|
|---|
B and Rel proteins: evolutionarily conserved mediators of immune responses. Annu. Rev. Immunol. 16: 225260.[CrossRef][Medline]This article has been cited by other articles:
![]() |
J. F. Thompson, C. L. Hyde, L. S. Wood, S. A. Paciga, D. A. Hinds, D. R. Cox, G. K. Hovingh, and J. J.P. Kastelein Comprehensive Whole-Genome and Candidate Gene Analysis for Response to Statin Therapy in the Treating to New Targets (TNT) Cohort Circ Cardiovasc Genet, April 1, 2009; 2(2): 173 - 181. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Karchin Next generation tools for the annotation of human SNPs Brief Bioinform, January 1, 2009; 10(1): 35 - 52. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Heid, E. Boes, M. Muller, B. Kollerits, C. Lamina, S. Coassin, C. Gieger, A. Doring, N. Klopp, R. Frikke-Schmidt, et al. Genome-Wide Association Analysis of High-Density Lipoprotein Cholesterol in the Population-Based KORA Study Sheds New Light on Intergenic Regions Circ Cardiovasc Genet, October 1, 2008; 1(1): 10 - 20. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Thompson, E. Di Angelantonio, N. Sarwar, S. Erqou, D. Saleheen, R. P. F. Dullaart, B. Keavney, Z. Ye, and J. Danesh Association of Cholesteryl Ester Transfer Protein Genotypes With CETP Mass and Activity, Lipid Levels, and Coronary Risk JAMA, June 18, 2008; 299(23): 2777 - 2788. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Molecular and Cellular Proteomics | ASBMB Today |