High-density genotyping and functional SNP localization in the CETP gene.

The cholesteryl ester transfer protein gene (CETP) has been the subject of hundreds of genetic analyses that typically focus on a small number of polymorphisms within a single ethnic group. Furthermore, the extent of DNA beyond the transcribed sequence from which single nucleotide polymorphisms (SNPs) may influence CETP expression has not been well defined. To better understand the role of natural variation in modulating CETP and high density lipoprotein-cholesterol (HDL-C) levels, dense genotyping of CETP and regions up to 15 kb on either side of the gene was carried out on >2,000 individuals. A complex, nonlinear set of linkage disequilibrium bins was found, with many bins interspersed along the DNA sequence and spread over large regions of the gene. Bins assigned based on large numbers of individuals matched the small subset of SNPs that had been assigned to bins previously with a small number of individuals. Associations of known functional SNPs with HDL-C were found, but there were suggestions that there are additional functional SNPs not characterized previously. Narrowing of the set of likely functional SNPs was accomplished by comparing associations observed in different ethnic groups. The promoter SNP most highly associated with HDL-C that is likely to be functional, position -4,502, alters a consensus transcription factor binding site.

Abstract The cholesteryl ester transfer protein gene (CETP ) has been the subject of hundreds of genetic analyses that typically focus on a small number of polymorphisms within a single ethnic group. Furthermore, the extent of DNA beyond the transcribed sequence from which single nucleotide polymorphisms (SNPs) may influence CETP expression has not been well defined. To better understand the role of natural variation in modulating CETP and high density lipoprotein-cholesterol (HDL-C) levels, dense genotyping of CETP and regions up to 15 kb on either side of the gene was carried out on .2,000 individuals. A complex, nonlinear set of linkage disequilibrium bins was found, with many bins interspersed along the DNA sequence and spread over large regions of the gene. Bins assigned based on large numbers of individuals matched the small subset of SNPs that had been assigned to bins previously with a small number of individuals. Associations of known functional SNPs with HDL-C were found, but there were suggestions that there are additional functional SNPs not characterized previously. Narrowing of the set of likely functional SNPs was accomplished by comparing associations observed in different ethnic groups. The promoter SNP most highly associated with HDL-C that is likely to be functional, position 24,502, alters a consensus transcription factor binding site.- Thompson The importance of the cholesteryl ester transfer protein gene (CETP) in affecting high density lipoproteincholesterol (HDL-C) levels in humans was originally detected when individuals lacking active protein were identified based on high HDL-C levels (1). Since then, individuals lacking CETP as well as those with varying levels of CETP or with variant sequences have been studied extensively (reviewed in Refs. 2,3). In addition to the complete null mutations, many single nucleotide polymor-phisms (SNPs) in CETP have been found to be reproducibly associated with protein mass/activity and/or HDL-C. When sample sizes are large enough, there is a high degree of consistency across studies and populations. Results with the closely linked phenotypes of CETP mass/activity and HDL-C are highly replicated, but associations with other, more complex phenotypes, such as cardiovascular disease, have been less easily replicated. However, when studies are of sufficient size and properly designed, associations with the more complex phenotypes can often be found. For example, a meta-analysis of the TaqIB SNP showed that the allele associated with low CETP was also associated with high HDL-C and lower levels of coronary artery disease (4). When studies examine only one or a small number of SNPs, integration of results with other studies can be challenging.
In addition to the numerous amino acid variants that have been detected in CETP (5), there is also evidence that promoter SNPs are even more significantly associated with HDL-C than those that change a single amino acid. Common promoter polymorphisms at positions 2629 and 2971 and a variable repeat sequence have all been reproducibly associated with CETP and/or HDL-C levels (6)(7)(8)(9)(10). These associations at the 5 ¶ end of the gene are stronger than those observed with SNPs causing amino acid changes at the 3 ¶ end of the gene. The functionality of the 2629 SNP has been linked to changes in an Sp1/Sp3 binding site (7,11), whereas the SNP at 2971 was shown not to affect transcription (8). No results have been reported on the functional role of the variable repeat sequence. Large segments of the promoter have been fused to reporter genes to determine functional regions, but only changes at positions 2629 and 238 have been examined as a function of naturally occurring polymorphisms (7,11).
Linkage disequilibrium (LD) within CETP makes the identification of functional SNPs difficult. When only lowdensity genotypic information was available, genome structure was generally approximated by linear collections of haploblocks. Within the CETP gene, initial studies showed two haploblocks, one covering the promoter region and the 5 ¶ half of the gene and the second covering the 3 ¶ half of the gene and including many nonsynonymous SNPs (6,10,12). As higher density SNP information has become available, it became apparent that a linear collection of haploblocks is a poor approximation of genome structure. An improved description consists of multiple LD bins that are interspersed along the linear DNA sequence, with each bin containing a distinct subset of SNPs (13). These bins are described empirically and, for any given region of the genome, the extent and number of LD bins vary across different ethnicities. Typically, there are cutoffs for both minor allele frequency (MAF; 5%) and extent of LD (R 2 . 0.8) to define SNPs within each bin (13).
We have previously published association studies with 20 SNPs in and near the CETP gene with z2,500 individuals (5,10,11,14,15). Although this has provided significant insight into the functional nature of CETP SNPs, there are still many unanswered questions about the role of different SNPs, how they are linked to each other, and the detailed genomic structure surrounding CETP. To help answer some of these questions, we have genotyped 63 additional SNPs in .2,000 individuals and integrated that data with other sources of information to generate a highly detailed map of CETP and its association with HDL-C.

Samples and genotyping
The Atorvastatin Comparative Cholesterol Efficacy and Safety Study (ACCESS) (16) was designed to determine the safety and efficacy profile of atorvastatin compared with other HMG-CoA reductase inhibitors when used to treat patients with National Cholesterol Education Program LDL-C criteria. Whole blood from participating subjects was obtained with appropriate institutional review and appropriate informed consent documentation that defined the study design and provided an assessment of the risks and benefits associated with study participation. A second European cohort came from a previous Pfizer clinical trial in the cardiovascular area that recruited healthy patients (n 5 664). DNA from another African-American cohort (n 5 250) was purchased from Genomics Collaborative, Inc. (Cam-bridge, MA). All laboratory tests were performed at a central laboratory (Medical Research Laboratories, Highland Heights, KY) certified by the National Heart, Lung, and Blood Institute/ Centers for Disease Control Part III Program. HDL-C was measured in a fasting sample. No subfraction analysis was done.
Genomic DNA was extracted from whole blood using the PureGene DNA isolation system (Gentra) according to the manufacturer's protocol. Some SNPs discussed here were first reported elsewhere (5,10,11,14,15) and genotyped as described in those publications. SNPs reported for the first time here were genotyped using either TaqMan or SNPlex technology according to the manufacturer's instructions (Applied Biosystems, Foster City, CA).

Statistical analysis
The goal of the statistical analysis was to test for significant genetic associations between HDL-C levels and CETP SNPs across European (n 5 3,129) and African (n 5 420) subjects. A small number of Asian subjects (n 5 36) were also used for comparison of fitted genotype effects, although this population was considered too small for hypothesis testing. The largest cohort studied was from ACCESS and included individuals with European (2,465), African-American (170), and Asian (36) ancestry, after removing subjects who were outliers (beyond 5 sigma) and/or who had missing data for critical phenotypes. Demographic information for the genotyped individuals is listed in Table 1.
It was determined via standard inspection of qq-plots that a log transformation was appropriate for the HDL-C response. Unfortunately, the variance in log(HDL-C) varied significantly by cohort as well as by gender, although to a lesser degree. However, within the ACCESS cohort, the African females exhibited much higher variance than either the African males or the Europeans of either gender. Hence, variance was allowed to vary by three factors: cohort, ethnicity, and gender.
The model used had log(HDL-C) as the response and genotype (coded as a three-level factor) as the main effect to be tested for. Explanatory covariates, all of which were significant against log(HDL-C), were as follows: age, gender, ethnicity, cohort, and alcohol consumption (coded as a three-level factor on weekly alcohol consumption: no drinks, 1 to ,10 drinks, and 10 or more drinks, based on the distribution of values).
A generalized least-squares model was used, allowing for the heterogeneous variance components described above. Genotype significance for individual ethnicities was evaluated by likelihood comparisons of the full model with one with the genotypes of the targeted ethnicity assigned to a single factor, using a Chi-square test. The final model used for overall significance was logðHDL À CÞ z age 1 gender 1 ethnicity 1 cohort 1 alcohol 1 genotype HDL-C, high density lipoprotein-cholesterol. Demographic and lipid values are provided for individuals who were genotyped from the three trials described in Materials and Methods. Values for age (years), HDL-C (mg/ml), triglycerides (mg/ml), and body mass index (m/kg 2 ) are all means. This model tested the hypothesis for overall genotype effect, and in addition, it was compared against an identical model with an ethnicity 3 genotype interaction term added, and the significance of this interaction term was evaluated using a Chisquare test comparing the deviances of the two models.
The hypothesis tests were validated with a permutation test. Specifically, because many of the SNPs tested were in high LD with each other (and hence were far from independent), multipletesting adjustment was performed by comparing the rankordered hypothesis test results against 5,000 permutations in which sample IDs were permuted within each ethnicity and then remerged to the full set of genotype values, thus preserving both the underlying LD structure and the explanatory covariates as they related to HDL-C while still simulating the null hypothesis for genotype effects. The adjusted P value for each SNP is the maximum (worst) of the false discovery rate so calculated and the point estimate of its P value from the individual permutation test.
In generating effects plots, the Asian population was merged with the African and European populations, then the entire data set was fit to a null model (i.e., one without a genotype term), and the residuals were plotted by ethnicity and genotype using box and whisker plots. Genotypes are denoted by A, B, or C, with A denoting the wild-type homozygotes (as defined by empirical observation of the pooled population), B denoting the heterozygotes, and C denoting the homozygotes in the minor allele, still based on empirical observation over all subjects. These definitions remained fixed across ethnicities, even if the MAF crossed the 50% barrier going from one ethnicity to another.

Genomic structure and LD blocks
To gain maximal information about the genetic structure surrounding CETP, databases and the literature were searched for all SNPs and other polymorphisms. The region we examined included 15 kb upstream of the CETP gene, the 22-kb gene, and 13 kb downstream of the gene. Segments of this region have been resequenced in 10 to 200 individuals (12,13,15,(17)(18)(19)(20)(21), with the greatest focus on the promoter and exons. Many laboratories carried out the sequencing in multiple ethnic groups. Within the heavily sequenced regions, all common SNPs have been identified, but some introns and regions outside of the gene have not been as well characterized for variation.
Only two sets of genotype data published to date span the entire 50 kb region of interest, HapMap (http:// www.hapmap.org/cgi-perl/gbrowse/hapmap20_B35/) and those SNPs published by Hinds et al. (13) (13) (all nonzero MAF) in this region. Thirteen of these 33 SNPs are not genotyped in the HapMap set. Using the individual genotype data from the data sets for which they are available, an initial set of LD bins was determined as described (13). The resulting bins were compared across data sets to the extent possible to determine which bins could be combined. Because all of these data sets have a limited number of individuals and none includes complete sequence information over the entire region, many of the LD bins are poorly defined. To better understand the nature of these bins and to collapse them into a smaller set, SNPs were chosen from across the region for analysis in a large, multi-ethnic population. Genotype data from the original databases and the literature were generated using a variety of techniques. Some SNPs are not amenable to genotyping by particular technologies, and not all could be assayed by the SNPLex technology used here, preventing us from obtaining a complete set of genotypes. To the extent possible, some SNPs unique to each data set were included so that comparison across studies could be accomplished.
All SNPs that we genotyped were in Hardy-Weinberg equilibrium with P . 0.05, with the exception of one SNP each in individuals of European and African ancestry. Both of these SNPs were P . 0.03 and thus within the expected range of normal variation for a study with this number of SNPs tested. Of the 103 published HapMap SNPs, we generated more in-depth data for 40 of them, including 38 with MAF . 5% in at least one population. Of the 33 Hinds et al. (13) SNPs, we generated data for 29 of them. By generating genotypes for thousands of individuals that includes SNPs from each of these sets of individuals, we are able to establish LD bins representing SNPs spanning all of these data sets. A summary of the allele frequency for each ethnicity for polymorphisms with MAF . 5% for which we have generated data (reported previously or here for the first time) is shown in Fig. 1. For completeness, uncommon SNPs and an additional 101 SNPs for which others have published individual genotype data are also included in supplementary Table I. Some SNPs have multiple dbSNP identifiers, and we have used the one chosen by National Center for Biotechnology Information and listed alternative numbers. Even though the group size for some populations is relatively small (20 to 50 individuals), the minor allele frequencies are consistent across studies within the same ethnic group (see supplementary Table I).
An LD chart with individual R 2 values for all of the SNPs we genotyped in 2,458 individuals with European ancestry with MAF . 5% is shown in supplementary Fig. I. This includes 56 SNPs across 50 kb. Using Haploview, seven LD blocks are identified that span 32 kb and include 50 SNPs (Fig. 2). Over the same genomic region, 59 SNPs genotyped in HapMap have MAF . 5% in the 90 person CEU population. Thirty-one of these SNPs are identical to those we genotyped. With the HapMap SNPs, six LD blocks are identified with Haploview that cover 25 kb and include 43 SNPs. Although the blocks in the CEU and ACCESS populations align well for the most part, there are differences in both the number of blocks and their boundaries, primarily in the 5 ¶ region of the gene. Because of the much larger ACCESS population, many more SNPs are incorporated into the LD blocks, including three that were genotyped by HapMap but not placed in LD blocks. Thus, the large number of individuals genotyped allows many more SNPs to be placed in LD blocks, but the relevance of these blocks to the detailed genomic structure is also much more apparent, as perhaps best visualized by the ''check- and African (column 5) ancestry. P values for association with high density lipoprotein-cholesterol (HDL-C) generated using a generalized least-squares method and adjusting for covariates but not for multiple testing are provided in columns 6, 8, and 10, with the smallest P values listed as ,0.00001 even though some are ,10 210 . False-discovery rates generated from a permutation analysis that better corrects for multiple testing, linkage disequilibrium (LD) structure, and violation of modeling assumptions are provided in columns 7, 9, and 11. Because only 5,000 permutations were done, the smallest P values are listed as ,0.001, but these could be substantially smaller if more permutations were attempted. Of note, the permutation test also revealed the conservative nature of the underlying model, as many uncorrected point estimates for P values were more significant in the permutation test than in the original model (particularly among Caucasians); this is the reason why many of the adjusted false-discovery rates are actually lower than the unadjusted P values. In columns 6-11, ,x means the P value is .x/10 but not .x, unless x is one of the minimum listed values cited above, or unless x 5 0.05, in which case the value is .0.01 but not .0.05. All SNPs were tested for association. In column 12, the exon/intron positions are provided. In column 13, the position relative to the start and end of transcription and SNPs that are located within the coding sequence is listed. The nucleotide position on chromosome 16 in Build 35 is listed in column 14. Columns 15-17 provide allele frequencies in the ACCESS trial. erboard'' pattern of LD in block 7, 3 to 8 kb downstream of the CETP gene. In addition, most of the weak LD interactions that appear in the HapMap population disappear with the much larger ACCESS population.

LD bins
Initial determination of LD bins was carried out to select SNPs for genotyping, but these bins were redefined after the complete set of genotypes was obtained. Numbering of LD bins is arbitrary. We have used the same numbers across ethnicities where possible, but it is clear that the boundaries for these bins are not the same across ethnic groups. As noted previously by Hinds et al. (13), the LD bins are highly noncolinear, with significant interdigitation of SNPs in different bins.
When we define LD bins using our data, the SNPs grouped together are very consistent within an ethnic group compared with those generated by Hinds et al. (13) or using the same definitions with HapMap data, even though both sets had far fewer individuals. The only discrepancies found with either data set are with SNPs that are very close to either the MAF or R 2 cutoff in one population or the other. Otherwise, there is perfect agreement for LD bin composition.
Using the cutoffs of MAF . 5% and R 2 . 0.8 for LD bin determination and all available data, 102 SNPs can be placed in 49 bins for individuals of European ancestry. The most populated bin contains eight SNPs, and the largest span covered by a single bin is 10,731 bp. For HapMap samples, 40 tagging SNPs representing 59 SNPs are identified. Nearly identical bins are generated with only two HapMap tagging SNPs falling into the same bin as defined here. When the additional 43 SNPs not found in HapMap are added, this required only an additional nine bins, confirming that there is a point of diminishing returns for genotyping but that genotyping at a frequency of greater than one SNP per kilobase still has the potential to generate useful data.
For individuals of African ancestry, there are 97 SNPs in 66 bins. The most populated bin contains seven SNPs, and the largest spans 6,621 bp. The larger number of bins and shorter extent of DNA sequence covered by each bin in those of African-American versus European ancestry are similar to what has been observed previously (13). Our bins match well with the African Americans characterized by Hinds et al. (13) but poorly with the Nigerian HapMap data. This highlights the difficulties of comparing across populations and ensuring that ancestry is matched appropriately. We have not attempted a comparison of our Asian ancestry bins with others because of the small number of individuals genotyped.

Associations with HDL-C
In addition to characterizing the genomic structure surrounding CETP, we also wanted to determine how the association with HDL-C was superimposed on that structure. Among individuals of European ancestry, the strongest associations are clearly in the promoter but span a very broad region. We have extended coverage to within 3 kb of the neighboring upstream gene, HERPUD1, 15 kb from the CETP transcriptional start. Our finding that a SNP .10 kb away (rs9989419) from the start site is associated with HDL-C suggests that distal interactions may play a role in regulating CETP levels. However, much of this association appears to arise from LD with nearby SNPs. All of the SNPs most highly associated with HDL-C among individuals of European ancestry are in LD bin 8. LD bins 6 and 10 are interspersed in this part of the promoter region but are orders of magnitude less significantly associated with HDL-C based on point estimates (Fig. 1). Several singleton SNPs in the promoter are also associated with HDL-C, but not as strongly as those in bin 8. One of the most highly associated SNPs, rs183130 in bin 8, has a consistent effect across ethnic groups, as shown in Fig. 3. The sequence conservation for this region among primates and the sequence surrounding two other functional SNPs is shown in Fig. 4. In each region, the chimpanzee sequence is identical to the human sequence, whereas other species have up to several changes. For each of the putative transcription factor binding sites, some of these nonconserved positions would be predicted to affect protein binding.
At the other end of the gene, the associations with HDL-C are not as strong as observed in the promoter region. Several nonsynonymous SNPs in this 3 ¶ region have been shown to be functional, with effects on CETP activity or stability. There may be functional SNPs in addition to the nonsynonymous SNPs. rs289748, which is .7,000 bp from the end of transcription, is associated with HDL-C in individuals of both European (P , 0.001) and African (P , 0.02) ancestry. This SNP is not tightly linked with other SNPs tested in either group. The functional source of this association is unknown.
All associations were tested for gender effects as well. Two SNPs in individuals of European ancestry and one SNP in individuals of African-American ancestry yielded gender-genotype interactions with P values between 0.03 and 0.05, not significant after correction for multiple testing. Interactions between SNPs were also tested. Because of the number of SNPs and comparisons involved, this was done only for individuals of European ancestry in the ACCESS trial. Four pairs of SNPs show strong, nonadditive interactions, even after correction for multiple testing. The most significant interaction, between rs12920974 and rs4783961, has an uncorrected P value of 1.7 3 10 28 , remaining significant even after correction for 2,616 tests. Both of these SNPs are in the promoter region. The HDL-C means for each combination of genotypes are shown in Fig. 5. The other SNP pairs that remain significant after multiple testing correction all include rs7203286 (in the distal promoter region) in combination with rs820299 (intron 2), rs158477 (intron 9), or rs4783961 (promoter).

DISCUSSION
Low HDL-C is known to be a major risk factor for cardiovascular disease (reviewed in Ref. 22). HDL-C is af- Fig. 3. The plot depicts the ordered genotype of rs183130 for each ancestry on the horizontal axis [labeled as ancestry genotype, where the genotype is named A, B, or C for homozygotes in the most common allele (among pooled ancestries), heterozygotes, or homozygotes in the less common allele, respectively]. The vertical axis represents the values of log(HDL) after adjusting for nongenotypic covariates in the model. The boxes in the plot are bounded above and below by the 75th and 25th percentiles, respectively (the quartiles), and the error bars/whiskers extend an additional 1.5 times the interquartile range from either boundary of the box. Points outside this extension are plotted as outliers. The horizontal line inside the box denotes the median. The deviation from mean HDL-C for each of the rs183130 genotypes is shown for the three ethnicities examined. fected by a variety of environmental factors such as alcohol intake, estrogen administration (23), and exercise (24) as well as by a host of genetic factors. The effect of CETP variation on HDL-C is robust (2) but can be obscured when small samples or particular SNPs are examined. For example, of the SNPs we examined, there are five within the CETP gene and one in the proximal promoter with minor allele frequencies of .10% that are not associated with HDL-C (P . 0.05) among the .2,400 individuals of European ancestry. If one were to look exclusively at these SNPs, one would mistakenly conclude that CETP is not associated with HDL-C levels.
All common CETP variants have, at most, modest effects on either CETP mass or activity. If the most significantly associated SNP, rs183130, is examined, the mean HDL-C for the common versus rare homozygote varies only from 46.3 to 49.7 mg/dl among individuals of European an-cestry, a difference of ,10%. In contrast, both gender and alcohol consumption have larger effects on HDL-C, with European females in the ACCESS trial having significantly higher HDL-C (54 mg/dl) than European males (44 mg/dl). Similarly, European males who consume .10 drinks per week have higher HDL-C (50.7 mg/dl) than those who consume none (41.9 mg/dl). Only rare, nonfunctional CETP variants have a large impact on HDL-C, and these also have a protective effect with respect to disease in large, prospective studies (25). Thus, the impact of common CETP SNPs can be readily observed on a population basis, but these are of little value when examining small numbers of individuals.
It is possible to overinterpret the effect of CETP on HDL-C and attribute any large change in HDL-C to CETP variants of modest functional significance, even when known environmental effects such as exercise are present   4. The sequence for 25 bp on either side of the three SNPs in transcription factor consensus sites is shown for humans and up to five other primates. Dusky titi is not shown for rs183130 because the sequence conservation was too low in that region. Each SNP is shown in reverse color, and putative transcription factor binding sites are shown above each sequence. The chimpanzee sequence was obtained from the University of California, Santa Cruz website (http://www.genome.ucsc.edu/cgi-bin/ hgGateway?org 5 Chimp and db 5 panTro2), and all other sequences were obtained from the Lawrence Berkeley Laboratory website (http://pga.lbl.gov/cgi-bin/ get_gene?id 5 131). (26). Even the high HDL-C induced by extreme exercise, such as running marathons (27), is not always protective for cardiovascular disease. Risk of sudden cardiac death is generally attributable to a variety of defects in cardiac structural and channel proteins, with .50% of such deaths attributed to hypertrophic cardiomyopathy (28), independent of CETP genotype.
When data for only a limited number of SNPs were available, it was most convenient to describe the genomic structure of CETP and other genes as a series of large haploblocks. Initial work was consistent in showing that a large number of SNPs in the promoter and 5 ¶ region of the gene were in LD with each other, whereas another set of SNPs in the 3 ¶ region constituted another haploblock (6,10,12). This simplified view of the gene was useful as a rough approximation but is not accurate when comparing SNPs supposedly in the same haploblock but really having little linkage. With much more data now available, it is clear that the more detailed approach of using LD bins or tagging SNPs from nonlinear parts of the genome is necessary for an accurate view. Although there are many approaches and definitions that can be used for defining LD bins and tagging SNPs, similar results are obtained across populations with the same ancestry. The bins defined by Hinds et al. (13) with only 24 individuals were nearly identical to those defined by us with .2,000 individuals.
Horne et al. (21) identified a set of tagging SNPs for CETP that overlap the SNPs examined here. Despite using a very different approach, many SNPs represented by their tagging SNPs fall into separate bins, as defined here. The overlap is not perfect, but the same overall picture of the genomic structure is generated. However, comparison of our data with this set of tagging SNPs also highlights the need to characterize extensive regions of DNA sequence for a given gene. Their most distal SNP examined was only 631 bp from the start of transcription; thus, the LD bins with the most significant associations with HDL-C were not tagged.
The promoter SNP at 2629 has been shown to affect Sp1 binding in vitro, reporter activity in cells, and CETP mass levels in humans (7,10,11). In contrast, the TaqIB SNP (rs708272) is among the most extensively studied, but there has been no indication that it exerts any functional effect. Although TaqIB is not in the same LD bin as the 2629 SNP, it is in reasonably high LD (R 2 5 0.72) in individuals of European ancestry. Furthermore, TaqIB is in the same LD bin as other SNPs that span a large region of the gene, including SNPs in intron 2 (R 2 5 0.85), intron 5 (R 2 5 0.81), and intron 7 (R 2 5 0.91) in individuals of both European and Asian ancestry. Any of these SNPs could potentially have some functional effect, or it could arise from some other uncharacterized SNP in this 9-kb region. The TaqIB SNP is also associated with HDL-C in African Americans, but it is a singleton in terms of LD bins. Unlike individuals of European and Asian ancestry, in whom the most strongly linked SNPs are 3 ¶ to TaqIB, the SNPs most strongly linked to TaqIB among individuals of African ancestry are in the promoter with rs183130 (R 2 5 0.68) and the variable number of tandem repeats (R 2 5 0.64) being in tightest LD. This suggests that the functional SNP(s) assessed when TaqIB was examined in Africans is different from the functional SNP(s) assessed in Asians and Europeans.
Even though the promoter 2629 SNP has been shown to have functional effects, it is clear that other SNPs in the promoter are also independently associated with HDL-C. The most highly associated SNPs are those in LD bin 8. However, among individuals of European ancestry, there are seven SNPs spread over 6,500 bp in bin 8, six of which we genotyped. Each could be examined on its own for functionality, but that is a challenging and not always fruitful endeavor. The availability of results from multiple ethnic groups makes it possible to decrease the number of potentially functional SNPs by taking advantage of the different LD structures. All six of the bin 8 SNPs we examined in Europeans are highly associated with HDL-C. Among Asians, all five of the SNPs in the homologous bin are also associated with HDL-C. Among Africans, the seven SNPs from bin 8 in Europeans are split into four separate bins, and only one of them is associated with HDL-C, rs183130 at position 24502. Although it is possible that there are distinct functional SNPs in the different ethnic groups, the consistent results ( Fig. 3) with this SNP suggest that rs183130 may be functional. Additional experiments will be necessary to confirm this.
When 11 promoter SNPs that are associated with HDL-C among individuals of European ancestry are scanned for transcription factor binding sites (29), the only one that results in a change is rs183130. When a G is present on the bottom strand (GGGATTCTCC), an 8:10 match to the consensus site for nuclear factor kB (GGGGYNNCCY) described by Ghosh, May, and Kopp (30) and a 10:10 match to the consensus site (GGGRDTYYCC) described by Liu et al. (31) are found. The alteration from G to A creates a mismatch in both consensus sequences. The Liu et al. (31) consensus is particularly interesting in that it also appears to bind members of the Sp1 family of proteins that have been shown to be important in regulating CETP at the proximal promoter SNPs at 2629 and 238 (7,11).
Large meta-analyses of some CETP SNPs have been published, and these show a consistent but variable association with HDL-C, CETP mass/activity, and other phenotypes (2,3). Results in studies with large numbers of individuals with European ancestry are in agreement with our results (supplementary Table I). Promoter SNPs rs12149545, rs4783961, and rs1800775 are significantly associated with HDL-C (6,8,32,33). Similarly, results with large numbers of individuals of Asian ancestry found rs3764261 and the VNTR highly associated with HDL-C (9), as we found. All of these studies have focused on the promoter sequence within 3,300 bp of the transcriptional start site. Most functional characterization of the promoter has been restricted to the proximal 3,000 bp (34), with only limited analysis beyond that region (35).
In addition to the univariate analyses, nonadditive interactions between SNPs may also be important, as seen with other lipid-related genes (36). Our observation that the association of some promoter SNPs (rs4783961 at 2971) is nonadditive with other SNPs mirrors previous findings (33) in which that SNP's function in vitro was dependent on other nearby SNPs. The fragment tested in vitro was only 1,707 bp long, so it did not include the SNP we found most significant, rs12920974 at 22,940, but the concordance of results clearly shows the complexity of genetic modulation of transcription.
The data provided here yield a means of comparing results across many studies by determining which SNPs are likely to yield similar information and which will not. The extensive genotyping also allows other investigators to compare their populations with an independent set to test whether differences in allele frequencies found in a case-control study might be attributable to problems with the control population rather than a true association. As stated above, the SNPs tested here are all in Hardy-Weinberg equilibrium, unlike some control populations described elsewhere. By examining the CETP gene in detail and across populations, we are able to predict which SNPs are likely to be functional. This approach is generalizable to other genes in which robust associations are found.