Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps

,

This new SOM approach was applied here to analyze and interpret the individual multivariate lipoprotein lipid data.

Logic
A characteristic feature of the SOMs is their ability to map nonlinear relations in multidimensional data sets into visually more approachable, typically two-dimensional planes of nodes. The overall concept of SOM analysis is illustrated in Fig. 1 . The input data to the SOM from each case i , i.e., from each plasma sample in this particular application, contain a number of variables used to form a vector d i = (d ,d ,...d ,d ) . The SOM algorithm ( 21,26 ) then transforms the input data vectors into a two-dimensional map in which each node j,k ( j goes over the rows and k over the columns, total of J rows and K columns) will be represented by a single feature vector x j,k = (x ,x ,...,x ,x ) 1 j,k 2 j,k N 1 j,k N j,k − representing the original N dimensional space, i.e . , the input data. After the self-organizing process, the point density of the feature vectors follows roughly the probability density of the data, thereby making SOM a valuable tool for detecting similarities and groupings in a data set. The training algorithm is rather simple (and also robust to missing values), and it is easy to visualize the resulting maps. The feature vectors of the neighboring nodes in the two-dimensional map are similar to each other and thereby, importantly, the individuals ending up in nodes close by are similar also in the original N dimensional space ( 21,24,27 ).
The visualization phase of the SOM analysis is two-fold: fi rst, to look at potential constellations of nodes (feature vectors) formed that would describe similar individuals (groups) in the original variable space; second, to depict input (or other related) variables over the two-dimensional map in order to obtain a quick overview of their distribution and values in different nodes, i.e., in the case of each feature vector. In other words, each node describes a model individual, which, in turn, bares a link to the individuals specifi ed in the original N dimensional space. The SOM algorithm, thus, offers the possibility to generate a form of average representations of model individuals along with identifying both metabolic and compositional characteristics and interrelationships out of multidimensional and complex lipoprotein data. Comparing the component planes of two or more variables in the twodimensional map may provide insights into the dependencies between the variables and their potential similarities or dissimilarities for the various groups of model individuals. The use of color coding in the component planes is particularly helpful because clearly colored areas as well as correlated changes in the colors of different variables are visually easy to detect. Although it is diffi cult to exactly defi ne groups in the organized map, subtle changes in colors are also good in indicating potentially diffuse borderline areas between various clusters ( 27 ).

Lipoprotein data
The lipoprotein lipid data represent complex metabolic conditions. The SOM analysis of these data revealed groupings of input data parameters that characterize and defi ne tissues is regulated by HDL that is often divided into two subpopulations, namely larger HDL 2 and smaller HDL 3 particles ( 7 ). Furthermore, the HDL metabolism and reverse cholesterol transport are related to the VLDL-IDL-LDL cascade via lipid transfer proteins ( 8,9 ).
Substantial amounts of data exist on alterations in circulating lipoprotein concentrations, for example, in atherosclerosis and in metabolic disorders like diabetes and the metabolic syndrome that are familiar backgrounds for various vascular complications (10)(11)(12)(13). The structural integrity and pertinent molecular composition of lipoprotein particles are the basis for the proper functioning of lipoprotein metabolism (14)(15)(16). However, the molecular composition as well as the metabolic and structural interrelationships between lipoprotein particles are often hampered due to limited measurement data available and the molecular complexity of lipoprotein metabolism. Physical isolation of lipoproteins by sequential ultracentrifugation (UCF) is common to measure plasma lipoprotein concentrations ( 17 ) but most studies are restricted regarding the analyzed subpopulations and detailed attention is rarely paid to the molecular composition of the isolated particles.
Here, we focus on an extensive set of UCF-based lipoprotein data in which the apoB particles were isolated as VLDL, IDL, and LDL fractions and the HDL particles separated to HDL 2 and HDL 3 (18)(19)(20). The use of UCF is tedious and expensive and thus, most often only the total HDL fraction is physically isolated and the IDL fraction might not be separately spun. However, the current data set provides an experimental extreme available for clinically oriented lipoprotein studies and was therefore considered optimal for assessing the capabilities of the self-organizing map (SOM) analysis to visualize and interpret lipoprotein metabolism. In fact, the SOM analysis enabled a holistic combination of plasma lipoprotein concentrations and the corresponding compositional features of the particles. Several well-known metabolic issues arose from the SOM analysis of these data per se. As a novel derivative, the analysis resulted in purely data-driven in silico lipoprotein phenotyping allowing detailed characterization of various compositional and metabolic details beyond the experimentally available classifi cations.

Background
The SOM is an unsupervised pattern recognition technique ( 21 ) that organizes the input data according to given similarity criteria. The end result is a two-dimensional map, where mutually similar input data profi les are placed next to each other and on which all the measures can be easily visualized and compared. The SOM analysis is currently one of the most popular neural network methods already recognized as an effective and advantageous tool to handle complex data in various areas [see, for example, refs (22)(23)(24)(25)(26)]. We have also recently implemented and developed the SOM analysis into a metabonomics framework with incorporated p-value statistics (26)(27)(28). males; 47% males). For some individuals, more than one sample was included from separate blood collections with a typical time interval of 6 months. The study population consisted of heavy alcohol drinkers (40%) ( 19 ), hysterectomised postmenopausal women on estrogen replacement therapy (41%) ( 20 ), and apparently healthy control individuals (19%) ( 19 ), thereby representing a wide range of plasma lipoprotein lipid values. The phenotypic characteristics of the study population are discussed in supplementary information I.

Ethics statement
The study protocol was in accordance with the Declaration of Helsinki and approved by the Ethical Committee of the Northern Ostrobothnia Hospital District, Oulu, Finland, and written informed consent was obtained from all subjects.

Isolation and composition of lipoprotein fractions
The blood samples were drawn after an overnight fast of 12 h into EDTA-containing tubes. Plasma was separated by centrifuga-model individuals according to both plasma lipid concentrations and lipoprotein particle compositions with a link to metabolic pathways. The component planes clarifi ed the clustering of the data set, i.e . , the grouping of model individuals into biochemically interpretable areas, for example, in which the concentration of VLDL triglycerides (TGs) is high but that of IDL-TGs is low. It is important to realize that the formation of the models is based solely on the experimental data and the self-organizing process of the SOM algorithm as illustrated in Fig. 1 .

Subjects
Biochemical lipoprotein lipid analyses were available from 233 individuals including 302 distinct lipid measurements (53% fe- valid. Consequently, q (instead of p) was used to denote the level of regional variability on the map ( 28 ). All the analyses were performed using in-house scripts in the MATLAB programming environment. An open source package, termed Melikerion ( 26,27 ), for SOM analyses in the Matlab/Octave programming environment is freely available. After constituting the SOM, the main regions with differing metabolic features were chosen by visual examination to further analyses. Some individuals residing in the borderline areas were excluded to result in clearer lipoprotein phenotypes. All the analyses were performed on a laptop PC with an Intel Core2 Duo, 2.0 GHz processor, which trained a typical SOM and calculated the colorings in a few minutes.

General aspects of the self-organizing map
Conventionally, correlation analysis is applied to study linear associations between two lipoprotein measures. We have illustrated this for the current data set in supplementary information II, supplementary Fig. II. The correlation patterns imply that at an individual level the lipoprotein particle structure is rather consistent, but within each main lipoprotein fraction substantial compositional variation takes place (a detailed discussion is available in supplementary information II). However, multiparametric and nonlinear relationships, inherent in lipoprotein metabolism, remain intrinsically undetected in simple correlation analyses between two variables. Hence, we will demonstrate how the SOM analysis can be used to extend beyond linear assumptions in a multiparametric manner.
The SOM analysis was performed using a combination of plasma lipoprotein lipid concentrations and compositional lipoprotein particle measures as inputs (for details, see Materials and Methods, SOM analysis). The SOM component planes for plasma concentration measures are illustrated in Fig. 2 and for compositional particle measures in  Fig. 2 ; for example, the samples with highest concentrations of VLDL lipids are clus-tion at 1200-1500 g for 10-15 min at 4°C. The lipoprotein fractions were isolated from plasma by sequential UCF using density ranges of <1.006 g/ml for VLDL, 1.006-1.019 g/ml for IDL, 1.019-1.063 g/ml for LDL, 1.063-1.125 g/ml for HDL 2 , and 1.125-1.210 g/ml for HDL 3 (18)(19)(20). The lipoprotein fractions were isolated from fresh plasma samples and the lipid and protein analyses were commenced immediately after isolation of each fraction. The concentrations of total cholesterol, cholesterol esters (CEs), TGs, phospholipids (PLs), and total protein in the isolated lipoprotein fractions were determined as described previously ( 18,19 ) and expressed as mmol/l in plasma for lipids and mg/dl for proteins ( Table 1 ).

SOM analysis
Four lipid concentration measures per lipoprotein fraction were used, namely TGs, PLs, free cholesterol (FC), and CEs, together with the corresponding compositional measures (marked with an asterisk) calculated by scaling the concentration measures with the total protein amount in each fraction (e.g . , VLDL-PL* = VLDL-PL/ VLDL-protein). These latter parameters are approximations for the number of lipid molecules (~mol/g) in each lipoprotein particle and thereby provide a measure of the molecular composition of the physically isolated lipoprotein fractions. The concentration and compositional lipid parameters were deliberately used together in the SOM analysis to enable direct association of the concentration and structural information. However, the compositional inputs (e.g . , VLDL-PL*) are not intuitive variables to interpret and therefore we have presented mass percentages (marked with two asterisks, e.g . , VLDL-PL**) in the SOM component planes to demonstrate the compositional variability. Particle sizes for the VLDL, IDL, LDL, HDL 2 , and HDL 3 fractions were estimated as previously described ( 16 ). Briefl y, the number of lipid molecules in a particular lipoprotein particle was calculated on the basis of the experimental data. The known average volumes of the lipid and protein molecules were then used to calculate the average particle size.
The input data were scaled between Ϫ 1 and 1 by ranktransformation for preventing unjustifi ed domination of any of the variables and then normalized to smooth the distribution of individuals into the grid. We chose a 5 × 7 map of hexagonal units (resulting in 8.6 samples per unit) and a Gaussian neighborhood function. We also did several runs with different map sizes leading to essentially similar results, as expected, as SOM is known to be rather insensitive to choices of its size and other parameters ( 23 ). After the positions of the individuals on the SOM were computed, the map was colored according to the biochemical variables within different parts of the SOM with overall permutation estimations for the p-values for the statistical significance of the patterns ( 27 ). The null distributions from the permutation analysis were also the basis of the color scale in each component plane so that variables can be compared visually while maintaining the statistical interpretation. When interpreting the input variables, the p-value estimation is no longer strictly The values are given as mean ± SD. TG, triglycerides; C, cholesterol; FC, free cholesterol molecules; CE, cholesterol esters; PL, phospholipids; Protein, total proteins. The details for the lipoprotein isolation and lipid analysis are given in the Materials and Methods section. nounced for the PLs, FC, CEs, cholesterol, and total lipids but somewhat different between the LDL-TG and HDL 2 -TG. This is most likely an indication of a nonstructural role of TG molecules in these lipoprotein particles.
Thus, the SOM analysis enables an unsupervised discovery of multiple (nonlinear) associations; i.e . , in composite data, a nonexistent linear correlation does not necessarily mean that the two measures would have no association. As noted above for the LDL and HDL 2 lipid concentrations, various metabolic pathways may exist that have distinct but different associations that, in the linear analysis, mix in such a manner that no clear common correlation is found.

High concentration of plasma HDL 2 is associated with two different lipoprotein phenotypes
The high plasma concentrations of HDL 2 lipids (the southern region of the SOM in Fig. 2 ) consistently associate with the low concentrations of plasma VLDL lipids ( 29 ) but differentiate into two subgroups with respect to LDL lipid concentrations. In the southeast region of the SOM, the high HDL 2 lipid concentrations relate to relatively high concentrations of plasma LDL lipids and large IDL, LDL, HDL 2 , and HDL 3 particles together with small VLDL ( Figs. tered in the northwest corner and those with high LDL lipid concentrations in the northeast and eastern areas of the map. The negative associations between VLDL and HDL 2 lipid concentrations are also clearly seen in Fig. 2 via the opposite colorings for the VLDL and HDL 2 component planes for PLs, FC, CEs, cholesterol, and total lipids.

Toward complex associations: LDL and HDL 2
In addition to the linear relationships (supplementary Fig. II), the SOM component planes shown in Figs. 2 and 3 are revealing further associations between the lipoprotein measures. For example, the SOM component planes in Fig.  2 give an explanation why the plasma concentrations of LDL lipids barely correlate with those of HDL 2 (supplementary Fig. IIA). The associations between LDL and HDL 2 lipid concentrations are complex in such a way that different metabolic models arose from the data in the SOM analysis. Each of these models correspond to various combinations of lipoprotein concentrations; those refl ecting positive associations between LDL and HDL 2 are separated in the southeast and northwest areas of the SOM and those representing negative associations in the north-northeast and southsouthwest areas of the SOM. These associations are pro- Fig. 2. Statistical colorings of the lipoprotein lipid concentration measures (in mmol/l) in the SOM analysis of the combined concentration and compositional lipoprotein variables. The coloring is according to the characteristics of the local residents within each hexagonal unit. The concentration levels are color-coded to visualize whether the concentration level is above (reddish), at (white) or below (bluish) the median of the variable. The numbers on selected units tell the local mean value for that particular region. The q-values below the plots indicate the probability of observing equivalent regional variability for random data (see Materials and Methods). Importantly, the very same SOM analysis is the basis for all the component planes shown (holds also for Fig. 3 ) and thereby each of them can be directly compared; i.e . , the distribution of the individuals is the same under every component plane. The abbreviations are as given in the caption for Fig. 1 . high LDL-C concentration in plasma could be associated with both small and large LDL particles. Generally, the amounts of LDL particles seem not to be able to account for the LDL-C concentration in plasma, because with the similar LDL-C concentrations, the LDL-particle numbers can differ ( 34,35 ).
Our current fi ndings via the SOM-based lipoprotein phenotyping indicate that the large LDL particles are associated with relatively high LDL-C (the southeast region of the SOM in Figs. 2 and 3 ) and the small dense LDL particles are mostly related to low plasma LDL-C concentrations (the western half of the SOM in Figs. 2 and 3 ). However, this association is not inclusive and there is a metabolic pathway in which the small dense LDL particles are related to high plasma LDL-C (the northeast region of the SOM in Figs. 2 and 3 ). The small LDL particles related to the high plasma LDL-C concentration seem to be rather TG-poor whereas the small LDL associated with low plasma LDL-C is enriched in TGs. Thus, the apparent contradictions noted above are most likely only refl ections of different characteristics in the study populations. In fact, these associations are a good example of how simple (linear) correlation analysis is not able to reveal differently associated subgroups (nonlinearities) in the data. It is also notable that high concentration of plasma LDL-C associated 2, 3 ). The LDL, HDL 2 , and HDL 3 particles are enriched in PLs and FC, whereas the VLDL and IDL particles are relatively PL-poor. In contrast, in the southwest region of the SOM, the high plasma HDL 2 lipid concentrations are associated with low concentrations of plasma lipids in all apoBcontaining lipoprotein particles, i.e . , VLDL, IDL, and LDL. All these apoB-particles are also relatively small. In general, it is evident from Fig. 3 that there are clear associations between the lipoprotein particle size and the PL as well as the protein content of the particles, higher amounts of PL and lower amounts of protein indicating larger particles.

Plasma LDL-C concentration and the structural subtypes of LDL particles
Quite contradictory results have been published regarding the association of plasma LDL-C concentrations with the composition and characteristics of LDL particles. It has been reported that small dense and large LDL particle distributions do not differ in plasma LDL-C concentration ( 30,31 ). On the other hand, large LDL particles have been connected to higher LDL-C concentrations than the predominance of small LDL particles ( 32 ). The binding of LDL particles to the LDL receptor has been shown to be reduced with dense as well as large LDL compared with LDL with intermediate particle sizes ( 33 ), indicating that For each lipoprotein fraction, the values are represented as mass percentages (**) for the lipids and protein or nm for the particle sizes. All other details are as given in the caption for Fig. 2 . The abbreviations are as given in the caption for Fig. 1 .  Fig. 4. A metabolic overview of the lipoprotein phenotypes arisen from the SOM analysis (illustrated in Figs. 2 and 3 ). The application of the SOM analysis to the combination of concentration and compositional lipoprotein data resulted in a novel perspective and also provided a subgrouping of the lipoprotein particles in each fraction, i.e . , an in silico lipoprotein phenotyping beyond the experimental data. The term 'lipoprotein phenotype' is used here to denote a collection of lipoprotein subtypes for VLDL, IDL, LDL, HDL 2 , and HDL 3 related to a particular plasma lipoprotein concentration profi le and forming a metabolically connected entity. Five different phenotypes were discovered (A-E as marked and color-coded on the SOM), all with characteristic plasma concentration profi les (indicated in the bottom) as well as distinct compositional features (summarized on the top). The scale in the concentration profi les indicates the total plasma lipid concentrations of the lipoprotein fractions in mmol/l. The apoB-containing VLDL-IDL-LDL cascade is the principal route in the endogenous lipoprotein metabolism and relates primarily to the transport and hydrolysis of TG. Thus, the metabolic pathways of the lipoprotein phenotypes are organized here in two platforms, one for TG-enriched (on the left, with an orange background) and one for TG-poor particles (on the right, with a bluish background). The HDL particles are also accordingly divided into two groups with respect to their relative TG content. The solid color-coded arrows represent the metabolic pathways of the apoB-containing lipoprotein particles within each lipoprotein phenotype. The connections between apoB and HDL particles are indicated by the bidirectional dashed arrows. The sizes of apoBcontaining particles as well as HDL particles are in scale although the sizes of the HDL particles are enlarged by a factor of 6. The relative contents of the various lipids in the lipoprotein particles are indicated by the up-and down-ward arrows. Structurally characteristic lipids in each particle are bolded. The abbreviations are as given in the caption for Fig. 1 . with high amount of small LDL particles is particularly risky with respect to cardiovascular disease ( 36 ).

Metabolic pathways of the lipoprotein phenotypes
The application of the SOM analysis enabled us to concomitantly assess various metabolic and compositional interrelationships between the experimentally isolated and characterized lipoprotein fractions. In fact, this appeared to be a novel viewpoint that also provided a detailed subgrouping of the lipoprotein particles within each fraction, i.e . , a detailed in silico lipoprotein phenotyping beyond the experimental data. The term 'lipoprotein phenotype' is used here to denote a collection of lipoprotein subtypes for VLDL, IDL, LDL, HDL 2 , and HDL 3 related to a particular plasma lipoprotein concentration profi le and forming a metabolically connected entity. An overview of the lipoprotein phenotypes arisen from the SOM analysis is given in Fig. 4 . The apoB-containing lipoprotein cascade, i.e . , VLDL-IDL-LDL, is a key route in the endogenous lipoprotein metabolism and relates principally to the transport and hydrolysis of TG ( 6 ). Thus, in Fig. 4, the metabolic pathways of the lipoprotein phenotypes are organized in two platforms, one for TG-enriched and one for TG-poor particles (refl ecting the SOM analysis shown in Fig. 3 ). The HDL particles are also accordingly divided into two groups with respect to their relative TG content. Five different lipoprotein phenotypes were discovered, all with characteristic plasma lipoprotein lipid concentration profi les, as well as distinct compositional and metabolic features. Some key fi ndings will be highlighted and discussed below.
A lipoprotein phenotype refl ecting characteristics of the metabolic syndrome Figure 4 depicts lipoprotein phenotype B in which all the lipoprotein particles are enriched in TG except IDL being fairly TG-poor (see also the northwest region of the SOM in Fig. 3 ). The VLDL particles in this phenotype are large and also enriched in PL and FC. However, the IDL and LDL particles are quite small. Interestingly, although VLDL particles are FC-enriched, the corresponding IDL, LDL, HDL 2 , and HDL 3 particles are FC-poor. The plasma concentration of VLDL total lipids is high but those of LDL and HDL 2 are low; the concentration of IDL total lipids is also somewhat elevated (see Fig. 4 and the northwest region of the SOM in Fig. 2 ). Consequently, the characteristics of lipoprotein phenotype B resemble those inherent for the metabolic syndrome ( 10 ).
These fi ndings are also in line with studies showing that the delipidation of large VLDL can produce low levels of LDL ( 37 ) and that large VLDL is related to small dense LDL ( 32,38,39 ). In addition, the small dense LDL phenotype has been linked to increased production and decreased catabolism of VLDL particles ( 40 ). This is consistent with our fi ndings here with respect to lipoprotein phenotype B in which the plasma concentration of VLDL is high and VLDL particles are large and TG-enriched, whereas the LDL lipid concentrations are low with the preponderance of small, TG-enriched LDL particles. It is also notable that the low percentage of FC in the IDL, LDL, and HDL parti-cles of phenotype B may be a structural issue to enhance the oxidative susceptibility of these lipoproteins ( 41,42 ).

Plasma lipoprotein concentrations do not predict lipoprotein phenotypes
Very similar plasma concentrations of VLDL and IDL in phenotypes A and E relate to signifi cantly different composition of the VLDL as well as IDL particles in these phenotypes; the VLDL and IDL particles in phenotype A are TG-enriched and CE-poor, the situation being the opposite in the case of phenotype E. Also, high plasma LDL concentration, which is characteristic for phenotypes C, D, and E, relates to remarkable variations in the composition of the LDL particles between the phenotypes and, notably, even more profound differences in the composition as well as the size of the VLDL and IDL particles.
Lipoprotein metabolism is a complex crosstalk of various lipoprotein particles as well as enzymes and lipid transfer proteins. For example, during lipolysis of apoB-containing particles the phospholipid transfer protein increases the particle distribution of HDL toward HDL 2 subclasses ( 9,43 ). On the other hand, the cholesteryl ester transfer protein mediates heteroexchange of CE and TG between HDL and apoB-containing lipoprotein particles, the transfer of TG being toward HDL and that of CE toward apoBcontaining particles ( 8,16 ). Therefore, it is not unexpected that the plasma lipoprotein concentrations alone can only give a limited view on the overall lipoprotein metabolism.

Rationale for the in silico lipoprotein phenotyping
Even though detailed data on lipoprotein particles would currently be preferred in cardiovascular research, the subpopulation analysis is usually based on particle size (e.g . , using gradient gel electrophoresis or nuclear magnetic resonance spectroscopy) and therefore, the chemical composition of the particles remains unknown ( 7,36 ). In analytical lipid biochemistry, sequential ultracentrifugation is the gold standard for physical lipoprotein isolation allowing for subsequent analyses of the molecular composition of the particles (18)(19)(20). However, the UCFbased lipoprotein work is most often restricted regarding the analyzed subfractions. This is because the fi ner the density ranges used for the isolation, the more tedious and expensive the analyses become ( 44 ). Consequently, it would generally be benefi cial to computationally enhance the UCF-based lipoprotein data as illustrated in this work. In particular, deeper insight into the compositional variations in the lipoprotein particles appears a fundamental issue. The lipoprotein physiology and pathophysiology is about transfer and exchange of various lipid molecules between the lipoprotein particles and tissues. Thus, not only the concentration but also the quality of lipoprotein particles and the form of transportation does matter.