|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Journal of Lipid Research, Vol. 47, 318-328, February 2006 Contribution of regulatory and structural variations in APOE to predicting dyslipidemia
* National Public Health Institute, Helsinki, Finland Published, JLR Papers in Press, November 29, 2005.
1 To whom correspondence should be addressed. e-mail:jari.stengard{at}ktl.fi (J.H.S.); csing{at}umich.edu (C.F.S.)
The objective of this study was to evaluate 1) whether non single nucleotide polymorphisms-coding (non-cSNP) in the apolipoprotein E gene (APOE) identified by resequencing studies contribute to statistically explaining dyslipidemia if variations in the two cSNPs in exon 4 that define the 2, 3, and 4 alleles are ignored, and 2) whether the contribution of these additional SNPs persists when variations in the cSNPs are considered. We used an ecological, multiple-population, data-mining strategy to identify single-SNP and two-SNP genotypes that distinguish between high and low levels of plasma lipids in three training samples, European-Americans from Rochester, MN, African-Americans from Jackson, MS, and Europeans from North Karelia, Finland. We found that a pair of SNPs located in the 5' region define genotypes A560T832/A560T832, A560T832/A560G832, and A560T832/T560T832, which distinguish between high and low levels of HDL-cholesterol (HDL-C), triglycerides (TG), and/or total cholesterol (T-C). The A560T832/- genotypes predicted high TG and high T-C in both genders in a large independent test sample from Copenhagen, Denmark. Prediction of high T-C in the Danish females was dependent on genotypes defined by the cSNPs. Our study suggests that both regulatory and structural variations should be considered when evaluating the utility of APOE for predicting dyslipidemia in the population at large.
Supplementary key words apolipoprotein E gene pleiotropy data mining regulation lipids
Cholesterol accumulation in arterial walls is an important contributing factor in the development of atherosclerotic cardiovascular disease (CVD) (1). Information about the genetic basis of interindividual differences in lipid metabolism is thus expected to be useful in risk assessment, providing clues for the development of nonpharmacological and pharmacological interventions and suggesting population-based disease prevention strategies for CVD (25). A plethora of variations in genes involved in lipid metabolism have been characterized (610). The statistical evaluation of the contributions of these genomic variations to variation in measures of lipid metabolism and risk of CVD presents one of the most difficult challenges facing CVD research. The biological realities that interactions between gene variations and environmental variations are the primary causes of interindividual differences in lipid metabolism and risk of CVD, and that these interactions are dynamic over the lifetime of the individual, serve as major obstacles for the study of phenotype-genotype relationships (11). Studies of the influence of the variation in the gene coding for apolipoprotein E (APOE) on quantitative blood measures of lipid metabolism have demonstrated both context-dependent genotype effects (1214) and phenotype-APOE genotype relationships that are less sensitive to contexts indexed by time and space (12, 14, 15). In the study reported here, we use data collected from three populations that are ethnically and geographically distinct to identify phenotype-APOE genotype relationships that are less sensitive to the influence of genetic and environmental contexts indexed by gender, ethnicity, and geographic location. The utility of the identified phenotype-genotype models is then tested in a sample from a large independent study of a fourth population. We chose this strategy to identify phenotype-APOE genotype relationships that are expected to have the greatest utility in predicting dyslipidemia in the broadest range of contexts.
Apolipoprotein E (apoE) is a structural constituent of many atherogenic lipoprotein particles, such as triglyceride (TG)-rich chylomicrons and HDLs, and is involved in their transport from one tissue or cell type to another (1618). It has three common isoforms, E2, E3, and E4 (19), which are encoded by three alleles, We addressed the first of these two questions using an ecological (21), multiple-population, data-mining strategy to identify SNPs, or pairs of SNPs, of APOE that define genotypes that statistically distinguish between high and low levels of HDL-C, TG, and/or T-C subgroups in a sample of European-Americans from Rochester, MN. Because heterogeneity in the phenotype-genotype relationship across different populations is an important concern to those seeking context-independent predictors of the risk of disease (22, 23), we then selected only those SNPs, or pairs of SNPs, that define genotypes that distinguish between high and low concentrations of at least two of the three measures of lipid metabolism in both genders in at least one of the two other independent samples collected, in Jackson, MS, and North Karelia, Finland. Our specific questions here are as follows. 1) How many SNPs, and pairs of SNPs, satisfy the proposed selection criteria? 2) What are the locations of the selected SNPs? 3) What are the relative frequencies of the single-SNP alleles and the two-SNP haplotypes defined by selected pairs of SNPs? 4) What are the high-risk and/or low-risk genotypes defined by the selected single SNPs and pairs of SNPs? We then asked whether 1) the hypothesized high-risk genotypes predict low HDL-C, high TG, and/or high T-C and 2) the variations in the 3937 and 4075 cSNP positions are related to the observed discriminative abilities of the proposed phenotype-genotype models using a large population-based sample of Europeans from Copenhagen, Denmark.
We used the National Cholesterol Education Program Expert Panel's recommendations for defining dyslipidemic subgroups (24). Dyslipidemia was diagnosed when an individual's blood T-C concentration was >200 mg/dl, TG was >150 mg/dl, or HDL-C was <40 mg/dl. Our research strategy involved three steps: 1) SNP selection using three independent samples; 2) selection of phenotype-genotype models using the information obtained in the SNP selection procedure with these samples; and 3) evaluation of the utility of the selected models in a fourth independent test sample. In the first SNP selection step, we first used a sample from Rochester to identify single SNPs and pairs of SNPs that defined genotypes that significantly distinguished between high-risk and low-risk subgroups for at least two measures of lipid metabolism in both females and males. The Rochester sample included 854 unrelated individuals (456 females and 398 males) recruited by the Rochester Family Heart Study (25, 26). The participants in the Rochester sample were requested to fast for 12 h before examination. For the subset of single SNPs and pairs of SNPs that significantly discriminated between high and low concentrations of two or more traits in both genders in the Rochester sample, we next considered the replication of the selected SNP effects in the Jackson and North Karelia samples as a second criterion for SNP selection. The Jackson sample included 702 unrelated African-American individuals (483 females and 219 males) who were part of the ongoing Genetic Epidemiology of Atherosclerosis study (27). The North Karelia sample included 337 unrelated individuals (188 females and 149 males) who were ascertained by an ongoing prospective study, the population-based FINRISK study (28, 29). Each participant in the North Karelia sample was measured for three lipid phenotypes twice, once at the baseline survey in 1992 and then in 1995 in connection with a 3 year follow-up examination (28, 30). To minimize the misclassification of dyslipidemia, we considered only those individuals from North Karelia who had high or low HDL-C, high or low TG, and/or high or low T-C at both the baseline and follow-up surveys. The subset of SNPs, considered singly or in pairs, whose ability to distinguish between high and low concentrations of multiple measures of lipid metabolism replicated in males and females in at least one of these two additional samples was then taken as the final set of selected SNPs. The participants in the Jackson study were requested to fast for 12 h, and the participants of the North Karelia study were requested to fast for 4 h, before examination. In the second step, we identified the haplotypes and genotypes defined by the selected SNPs that were responsible for the observed statistically significant phenotype-genotype associations observed in the first step and whose effects were replicated in both genders in at least one of the two other independent samples collected in Jackson and North Karelia. In the third and final step, we tested the utility of phenotype-genotype models established in step 2 for predicting low HDL-C, high TG, or high T-C in large population-based samples of females and males collected in Copenhagen. The Danish sample included 9,011 unrelated, native-born, non-Hispanic European individuals (4,947 females and 4,064 males) ascertained without regard to health status in connection with the third examination of the Copenhagen City Heart Study (31, 32). The participants in the Danish study were not requested to fast before examination. All participants in the Rochester, Jackson, and North Karelia samples gave informed consent, and the Copenhagen City Heart Study was approved by the Danish Ethics Committee for Copenhagen and Frederiksberg (No. 100.2039/91). Blood HDL-C, TG, and T-C concentrations for the Rochester and Jackson samples were measured at the Mayo Clinic (Rochester, MN) using published methods (20, 33, 34). The Finnish and Danish samples were measured by standard enzymatic assays (Boehringer Mannheim GmbH Diagnostics, Mannheim, Germany) at the Department of Biochemistry, National Public Health Institute, in Helsinki (28, 35) and at the Department of Clinical Biochemistry, Rigshospitalet, Copenhagen University Hospital (32), respectively. The methods used to genotype the APOE SNPs have been described by Nickerson et al. (9) for the Rochester, Jackson, and North Karelia samples and by Frikke-Schmidt et al. (32) for the Danish sample. The relative frequencies of two-site haplotypes for each population were estimated using an E-M algorithm (36). In the first SNP selection step, we used the combinatorial partitioning method (CPM) (37) as a data-mining tool to evaluate the ability of genetic variations defined by one- and two-SNP genotypes to distinguish between high and low concentrations of HDL-C, TG, and T-C in the female and male Rochester samples. This method was developed to identify partitions of genotypes that statistically explain interindividual variation in quantitative trait levels. We modified the CPM for this study to identify partitions of single- and two-SNP genotypes that statistically distinguish dichotomized trait levels. In this modified strategy, we first estimated the prevalence of the trait of interest (e.g., low blood HDL-C concentration) for each genotype in the set of genotypes defined by a particular SNP or pair of SNPs. The genotypes were then ranked according to their prevalence estimates. The ranked genotypes were partitioned into groups, and the prevalence was reestimated for each partition. The utility of each set of partitions for distinguishing between high and low trait levels was evaluated using the contingency Chi-square statistic. For each SNP and each pair of SNPs, this strategy selects the set of partitions that maximized similarities of the prevalences associated with genotypes within partitions and minimized similarities of the prevalences assigned to different partitions of genotypes. At present, there is no formal, widely accepted, statistical strategy for distinguishing statistically significant results from a single study that are a consequence of "true" biological effects from those that are type I errors (11). Hence, we used an ad hoc strategy to minimize the possibility that the significant result of a particular CPM analysis is a type I statistical error by selecting only those SNPs, or pairs of SNPs, that define genotypes that distinguish between high and low blood concentrations of at least two measures of lipid metabolism in both females and males, first in the Rochester sample, and subsequently in both female and male samples from Jackson or North Karelia or from both samples. We next used a second data-mining strategy to identify the single-SNP and/or two-SNP genotype(s) that are most likely responsible for the statistically significant phenotype-genotype associations in the Rochester, Jackson, and North Karelia samples. This involved identifying those genotypes that have a higher prevalence of the trait of interest (e.g., low HDL-C) than the overall prevalence in the gender/population sample being considered. Again, we selected only those genotypes whose higher ranking was consistent across at least five of the six gender/population samples.
Finally, the utility of the phenotype-genotype models obtained in the two data-mining steps for predicting dyslipidemia was evaluated in the Danish sample using conventional logistic regression analysis (38). Unless noted otherwise, we considered a nominal
Description of the Rochester sample Gender-specific means and variances of age, basic anthropometric characteristics, and the three blood measures of lipid metabolism, HDL-C, TG, and T-C, are given in Table 1. The average age of the female and male samples was similar (48 years), but the variability in age was significantly greater in females. On average, females were significantly leaner, and they were less frequently dyslipidemic (20, 19, and 39% for low HDL-C, high TG, and high T-C, respectively) than males (56, 35, and 47%, respectively). The estimates of interindividual variance of body mass index were significantly greater in females than in males.
Utility of single-SNP genotype variations for distinguishing between high and low HDL-C, TG, and/or T-C in the Rochester sample The tests of associations between lipid traits and single-SNP genotype variations are summarized in the diagonal cells of Fig. 1, separately for females and males. Only 1 of the 10 SNPs (5361) defined a single-SNP genotypic variation that distinguished between high and low concentrations of more than one blood measure of lipid metabolism in either gender.
Utility of two-SNP genotype variations for distinguishing between high and low HDL-C, TG, and T-C in the Rochester samples The tests of associations between lipid traits and two-SNP genotype variations are summarized in the off-diagonal cells of Fig. 1, separately for females and males. Twelve pairs of SNPs in females (26%; denoted by red, blue, green, or purple in Fig. 1) and 23 pairs in males (51%; also denoted by red, blue, green, or purple in Fig. 1) defined two-SNP genotype variations that distinguished between high and low concentrations of more than one lipid trait. Of these pairs, only five (560-832, 560-4075, 624-5361, 2440-5361, and 4075-5361; denoted by black boxes in Fig. 1) distinguished between high and low trait concentrations in both genders. For each of these pairs, we next considered the replication of the phenotype-genotype association in the Jackson and North Karelia samples as a second criterion for SNP selection.
Utility of selected two-SNP genotype variations for distinguishing between high and low HDL-C, TG, and T-C in the Jackson and North Karelia samples
Relative frequencies of the two-SNP haplotypes defined by variations in the non-cSNPs at positions 560 and 832 Adenine (A560) and guanine (G832) are the most common nucleic acids at the 560 and 832 sites, respectively, in all three samples (Table 3). Estimates of the relative frequencies of these alleles, however, were heterogeneous among the three populations. The relative frequency of the A560 allele was 20% lower, and that of the G832 allele 150% higher, in the Jackson sample than in the Rochester and North Karelia samples. The A560 and G832 alleles define the most common two-SNP haplotype in all three populations. The A560 allele together with thymine at the 832 position (T832) define the second most common haplotype in the Rochester and North Karelia samples, whereas in the Jackson sample this haplotype was the least common. The T560 and G832 alleles define the second most common two-site haplotype in the Jackson sample.
Identification of the most informative two-SNP genotypes defined by SNPs at positions 560 and 832 Prevalence estimates of low HDL-C, high TG, and high T-C in each of the six gender/population samples are denoted by red lines in Fig. 2A, B, C, respectively. These estimates ranged between 555:1,000 and 29:1,000 for low HDL-C (Fig. 2A), between 434:1,000 and 189:1,000 for high TG (Fig. 2B), and between 881:1,000 and 388:1,000 for high T-C (Fig. 2C). The test of heterogeneity of the prevalences among the six gender/population samples was statistically significant at P < 0.001 for each of the three lipid traits.
Prevalences of low HDL-C, high TG, and high T-C for each of the observed two-SNP genotypes defined by the 560-832 pair of SNPs are given in Fig. 2A, B, C, respectively, separately for each of the six gender/population samples. Prevalences of high and low lipid concentrations in subsamples of carriers of the T560T832 and T560G832 haplotypes tended to deviate more from the prevalences of the respective gender/population samples than did prevalences in subsamples of individuals who were either homozygous or heterozygous for the two common haplotypes A560T832 and A560G832. Rankings of genotype-specific prevalences vary from one lipid trait to another within a particular gender/population sample, as well as from one gender/population sample to another for a particular lipid trait. There are exceptions, however. The prevalence of low HDL-C in the subsample of A560T832/A560T832 homozygous individuals was higher than the sample prevalence in five of the six gender/population samples. Furthermore, the prevalence of high T-C in this subsample of homozygotes was higher than the sample prevalence in all six gender/population samples. Using a Sign's test (39), the probability of observing the observed ranking of the A560T832/A560T832 genotype with respect to the prevalence in each of the gender/population samples, assuming that there is no association between this genotype and prevalence, is 0.109 for five of six rankings and 0.035 for six of six rankings. The prevalences of high TG in the subsample of A560T832/A560G832 and A560T832/T560T832 heterozygous individuals was lower than the sample prevalence in five of six gender/population samples, whereas the prevalence of high T-C in subsamples of A560T832/A560G832 heterozygous individuals was higher than the sample prevalence in five of the six gender/population samples. In summary, we conclude from the analyses of the Rochester, Jackson, and North Karelia samples that the A560T832 haplotype-containing genotypes are the most informative predictors of dyslipidemia. Individuals who are homozygous for the A560T832 haplotype have an increased risk of low HDL-C that is consistent among samples that differ in gender, ethnicity, and geographic location. A subsample of A560T832/A560T832 homozygous and A560T832/A560G832 and A560T832/T560T832 heterozygous individuals (denoted as A560T832/-) have a decreased risk of high TG but an increased risk of high T-C. We next tested the utility of these recessive and dominant genetic models in distinguishing between low HDL-C and high TG and T-C, respectively, using data from large population-based samples of females and males collected in Copenhagen.
A test of the utility of the selected two-SNP genotypes in predicting dyslipidemia in large population-based samples of females and males from Copenhagen
Genotypes defined by the two cSNPs were statistically significant predictors of high T-C and high TG in both genders and of low HDL-C in females only. There was no evidence of a statistically significant interaction between the effects of the group of A560T832/- genotypes and the effects of genotypes defined by variations in the two cSNPs 3937 and 4075 in predicting low HDL-C and high TG (Table 4). The ORs for low HDL-C and high TG when the two cSNPs are ignored were in the same range as the adjusted ORs estimated when the two cSNPs are included in the prediction model. There was a statistically significant interaction between the effect of the group of A560T832/- genotypes and the genotypes defined by variations in the two cSNPs in the prediction of high T-C in females (Tables 4, 5). The estimated OR for high T-C is significantly higher (1.24; 95% confidence interval = 1.001.53) for the 4 allele-carrying females in the A560T832/- genotypes group and significantly lower (0.78; 95% confidence interval = 0.640.94) for the 3/3 group of females compared with females with the 3/3 genotype who did not have the A560T832/- genotypes. The group with the A560T832/- genotypes was not identified as a statistically significant predictor of high T-C in males when variations in the two exon 4 cSNPS were included in the prediction model.
An alternative research strategy A commonly used strategy for identifying genetic variations that are predictors of phenotypic variation is to collect a large representative sample from a particular population, use statistical summaries to test phenotype-genotype hypotheses, and turn to Baconian induction to infer the generality of genetic effects (4042). An integral part of such a strategy is that a statistically significant, empirically derived hypothesis must survive further testing in other studies of other samples to become a universal "truth" (42). The expectation is that the surviving hypothesis can then be used to predict future events in any population (43). Genetic analyses of phenotypes that have a complex multifactorial etiology, such as dyslipidemia, challenge this induction/deduction paradigm because it ignores the possibility that the hypothesis generated is dependent on the context of the population studied. The predictions of the proposed hypothesis simply may not survive further testing because of the heterogeneity of the phenotype-genotype relationship among populations or the lack of statistical power associated with small samples. As likely is the possibility that it may not survive further testing in any population because a hypothesis derived from the study of only one population may be a type I error (11). We suggest here an alternative strategy to this induction/deduction paradigm that reduces the possibility that the initial hypothesis is a type I error by applying an ecological data-mining strategy to samples collected from multiple populations to generate a hypothesis that is expected to be less sensitive to context. This multiple-population data-mining strategy sorts out those hypothesized phenotype-genotype relationships that are less likely to be type I errors and more likely to be of utility in unstudied populations that differ for genetic and environmental contexts indexed by gender, ethnicity, and geographic locations. Although this strategy increases the likelihood that a particular genetic variation may have utility in predicting phenotypic variation in an unstudied population, we emphasize that the predictive utility realized in independent samples of Danish females and males must be reevaluated anew in subsequent populations of interest because of the anticipated role of context dependence in the etiology of measures of lipid metabolism. We discuss below 1) the limitations of this research strategy for modeling the genetic architecture of measures of lipid metabolism; 2) the relationship between phenotypic variation in lipid traits and variation in APOE identified by this strategy; 3) how the proposed phenotype-genotype model reflects current knowledge about the biology of APOE and lipid metabolism at the cellular level; and 4) how this phenotype-genotype model can be used in medical practice and/or public health programs in a particular population of interest.
Limitations of the research strategy for characterizing the genetic architecture of lipid and lipoprotein traits There are several shortcomings of the ecological data-mining strategy for modeling the biological relationships between phenotype and genotype. Statistical models that have general applicability across populations cannot be expected to capture the biological complexities of the connections known to be involved. The role of population-specific gene-gene and gene-environment interactions and population-specific age-dependent exposures to specific environmental agents can only be studied on a population-by-population basis. More importantly, in common with all association studies, most single-gene effects on the phenotype of interest in a particular population cannot be estimated because 1) they are too small to measure; 2) they cannot be accurately estimated; 3) they are confounded with the effects of unmeasured genetic and/or environmental agents (44) and/or even chance (45); 4) the effects are inseparable from the effects of closely linked gene variations; and 5) the complexities of the cause-and-effect connections through the intermediate pathways to the phenotype result in no detectable association between phenotype and genotype (11, 46, 47). In addition, genetic influences are distributed throughout multiple intermediate pathways that lead to the dyslipidemia phenotype. A linear statistical model cannot capture the nonlinear processing of genetic effects through the pathways that connect genotype with phenotype. Such impenetrable features introduce uncertainty into the application of any strategy for modeling genetic predictors of phenotypes that have a complex multifactorial etiology.
Lipid phenotype-APOE genotype statistical models The statistically significant associations between the three measures of dyslipidemia and the three genotypes in the test samples of Danish females and males are consistent with the hypothesis that variation in the 5' promoter region of APOE has pleiotropic effects on lipid metabolism. This finding is also consistent with an earlier observation reported in young and middle-aged Danish females by Frikke-Schmidt et al. (32) that combining SNP variations in the 5' promoter region and in the exon 4 structural region doubled the estimated proportions of HDL-C variation that could be statistically explained compared with the proportion explained by the six exon 4 genotypes considered separately. The biological reality that structural variation in exon 4 of APOE has an important role in lipid and lipoprotein metabolism (1618) raises the question of whether the observed abilities of the 5' genotypes to distinguish between high and low HDL-C, TG, or T-C are attributable to the effects of variation in the 5' promoter region or to an association attributable to linkage disequilibrium (LD) with the structural variation in exon 4. Frikke-Schmidt et al. (32) reported statistically significant pair-wise LD between SNPs in the 5' promoter region and in the exon 4 structural region in the Danish sample. However, the magnitudes of the relevant LD estimates were low. The r2 measure of LD ranged from 0.023 to 0.079 for the 560-4075 and 832-4075 pairs of SNPs, respectively. It is unlikely that such weak pair-wise LD between these two regions could be responsible for the association of measures of lipid metabolism with particular 5' genotypes observed here. The statistical independence of the 5' genetic effects is consistent with our observations that low HDL-C, or high TG, was significantly associated with particular 5' genotypes in three of the four analyses that included the exon 4 variation. The small role of LD is further supported by a statistically significant interaction between the effects of 5' genotypic variation and the exon 4 structural variation in predicting high T-C in females.
Biological inferences from phenotype-genotype statistical models
Artiga et al. (51) have found that variations in the 560 and 832 positions are associated with a significant heterogeneity in promoter activity in cell cultures. The A560T832 haplotype is associated with
Applicability to clinics and public health
Conclusions
The authors thank Kenneth G. Weiss for his persistent attention to the details of the data management and statistical analyses. The technical support of Lynn Illeck in developing this article is also deeply appreciated. This work was supported by National Institutes of Health Grants HL-072905, HL-072810, GM-066509, HL-054481, HL-051021, HL-039107, HL-058238, HL-058239, and HL-058240. Manuscript received May 17, 2005 and in revised form November 8, 2005.
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||