Research ArticleOpen Access
Association between Percentages of Scale Variables, as Related to Distributions
Faculty of Medicine, Institute of Health and Society, University of Oslo, Norway
*Corresponding Author: |
Høstmark AT |
Institute of Health and Society | |
University of Oslo | |
Box 1130 Blindern, 0318 Oslo | |
Norway | |
Fax: +47 22850590 | |
Tel: +47 22844629 | |
E-mail: [email protected] |
Received: March 01, 2019; Published: April 17, 2019
Citation: Høstmark AT (2019) Association between Percentages of Scale Variables, as Related to Distributions. J Nutr Diet Suppl 3(1): 104
We recently reported that the concentration distribution per se explains the positive correlation between percentages of arachidonic acid (AA, 20:4 n6) and eicosapentaenoic acid (EPA, 20:5 n 3), fatty acids with antagonistic actions, and suggested Distribution dependent correlations to be a physiological regulatory mechanism. With many positive scale variables, A, B, C…., it is not apparent whether e.g. %A is positively or negatively associated with %B. The aim of the present work was to find some general rules that govern the association between percentages, using algebraic and geometric reasoning, and computer experiments to test hypotheses. Using random numbers, the analyses were first simplified to involve three positive variables. Thus, %A+%B+%C=100, where C may be considered as the sum of many variables, except A and B; thus, %B=- %A+(100-%C), each variable with a particular range. To obtain two unknown variables only, the equation was simplified in two ways: 1) by making the expression (100-%C) approach zero, and 2) by making %C approach zero. Condition 1) may be achieved with %C–values close to 100, implying low %A- and %B- values. If so, the equation should approach %B=%A, giving a positive association between %A and %B; however, the slope will be determined by the ranges of A and B. Thus, a more precise general equation would be: %B_{(p - q)}=- %A _{(r - s)}+(100 - %C (t –u)) where the subscript parentheses indicate ranges of A, B, and C. Condition 2) would give %B=-%A+100, i.e. showing an inverse %A vs. %B relationship. We anticipated positive (negative) correlations also close to these conditions. Computer experiments with random numbers, and varying distributions (ranges) of the variables, seemed to verify the theoretical reasoning above, and suggest these rules: 1) With 3 positive scale variables, two of which having low-number distribution, and low variability, as compared with the third variable, we might expect a positive association between percentages of the low-number variables, and a negative association between percentage of the high-number variable and percentage of each of the low-number variables. 2) A decrease (increase) in the variability of either or both of the two low-number variables will improve (make poorer) the association between their relative amounts. In contrast, a narrowing (broadening) of the distribution of the high-number variable will make poorer (improve) the association between percentages of the low-number variables. 3) With 3 positive scale variables two of which having high-number distribution as compared with the third variable, we might expect a negative association between percentages of the two high-number variables, and probably no significant correlation between percentage of the low-number variable and percentage of each of the high-number variables. 4) Distributions of 3 scale variables can be manipulated so as to obtain a Turning Point, i.e. a situation where a positive (negative) association between percentages of two of them turns to become negative (positive). 5) With more than 3 variables, we may calculate the sum of all, minus the two under investigation; the preceding rules may apply with this “3-variable-modification”. 6) Distribution dependent correlations may be used as a regulatory mechanism in physiology.
Keywords: Correlation; Percentages; Uniform distribution; Normal distribution; Random numbers
List of Abbreviations: LA: Linoleic Acid (18:2 n6); ALA: Alpha Linolenic Acid (18:3 n3); AA: Arachidonic Acid (20:4 n6); EPA: Eicosapentaenoic Acid (20:5 n3); DPA: Docosapentaenoic Acid (22:5 n3); DHA: Docosahexaenoic Acid (22:6 n3)
Definitions: Range: difference between the largest and smallest values.
Distribution: graph showing the frequency distribution of a scale variable within a particular range. In this article, we may also write distribution when referring to a particular range, a-b.
Uniform distribution: every value within the range is equally likely. In this article, we may write “Distribution was from a to b”, or “Distributions of A, B, and C were a-b, c-d, and e-f, respectively”.
“Low–number variables” have low numbers as compared with “high-number variables”.
In nutrition studies we may raise the question of whether the relative amount of a diet-related blood/tissue variable is positively or negatively associated with the percentage of another variable. But how will percentages of the same sum, in general, be related? To our knowledge, this issue has not been much investigated. One reason for the apparent lack of interest could be related to a methodological concern arising when correlating percentages, since the same sum appears in the denominator when calculating corresponding percentages. Thus, an observed association between relative abundances could possibly be a spurious correlation. It is, however, not a priori apparent whether a significant correlation between percentages will be positive or negative, and to what extent the association may be attributed variability.
Arachidonic acid (AA) is formed in the body from linoleic acid (LA), a major constituent in many plant oils, and is converted by cyclooxygenase and lipoxygenase into various eicosanoids, i.e. prostacyclines, thromboxanes and leukotrienes [1-3]. AA derived thromboxane A_{2} (TXA_{2}) and leukotriene B_{4} (LTB_{4}) have strong proinflammatory and prothrombotic properties. Furthermore, endocannabinoids, which are derived from arachidonic acid, may have a role in adiposity and inflammation [4]. It is well known that EPA and AA are metabolic antagonists [1-3]. Eicosanoids derived from EPA may decrease inflammatory diseases [5,6], development of cardiovascular diseases [7], and cancer [8]. When considering the beneficial health effects of foods rich in EPA, many of the positive effects would be anticipated if the fatty acid works to counteract effects of AA. It has been reported that a decreased level of the serum EPA/AA ratio may be a risk factor for cancer death [8]. It would appear, accordingly, that a coordinated regulation of the relative abundances of EPA and AA could be of physiological interest, so that an increase (decrease) in the percentage of one of these fatty acids would be accompanied by a concomitant increase (decrease) in percentage of the other. Indeed, we recently reported this outcome in breast muscle lipids of chickens [9,10].
Furthermore, we observed that the concentration distribution per se of AA, EPA, DPA, and DHA seemed to be crucial for obtaining a positive association between their relative abundances [9,10]. This conclusion was largely based upon computer experiments with random numbers, sampled within the real, physiological concentration distributions of the fatty acids. The computer experiments demonstrated that even small changes in distributions might cause appreciable changes in scatterplots, and correlation coefficients. We hypothesized that evolution might have “chosen” particular concentration ranges to ensure that percentages of fatty acids with antagonistic actions became positively associated, in order to make their relative abundances balanced. Furthermore, if distribution ranges were essential, then changes in concentration distributions should disturb the relationship between e.g. %EPA and %AA, and so was indeed observed in our analyses [9,10]. Thus, according to this hypothesis, relative abundances of two scale variables could be positively associated as a consequence of their particular concentration distributions (ranges), i.e. there is a Distribution dependent regulation. Variability could be related to differences between subjects, but also depend on intra-individual variations, for example related to time, diet, and environment in general, implying that the Distribution dependent regulation might take place both between and within subjects.
Our previous observations raise the question of whether there are some general rules governing Distribution dependent correlations. In this regard, we previously provided some suggestions [9,10]. The aim of the present work was to 1) further develop a theoretical basis to explain such correlations, 2) carry out computer experiments to test hypotheses, and finally, 3) find some general rules that govern the association between relative amounts of scale variables. Some of the calculations are repeats (with different sets of data) of previous analyses [10]; inclusion here is to provide a more complete presentation of how Distribution dependent correlations are brought about.
In our previous works [9,10], we investigated the association between the relative amount of AA and percentages of the n3 fatty acids (EPA, DPA, DHA). From histograms, we found physiological concentration distributions (g/kg wet weight) for the fatty acids. Next we computed the sum (g/kg wet weight) of all fatty acids, and the remaining sum when omitting the couple of fatty acids under investigation. We then had 3 scale variables only. With these variables, and with surrogate random number variables generated within the true distributions, we did analyses as shown below. For the purpose of the present work, we name the 3 variables A, B, and C. Our previous analyses suggested that the question of whether e.g. %A and %B were significantly correlated or not, depended upon the particular distribution (range) of each of the variables, as shown by comparing outcomes based upon real values (obtained in a diet trial) with the results found using surrogate, random numbers with varying distributions. In the present work, we solely use random numbers to explore conditions governing the association between percentages of A and B. Dependency between percentages is shown by the equation %A+%B+%C=100. Using random numbers for the three variables, we made scatterplots for the %A vs. %B association, and did correlation analyses. We used both Pearson correlations (r) and non-parametric, Spearman’s correlation (rho); the outcomes did not differ, but the sizes of corresponding correlation coefficients varied slightly. Finally, we studied how alterations in the distributions (ranges) of A, B, and C might change the relationship between percentages of A, B, and C. For each analysis, we made several repeats with new sets of random numbers; the general outcome of the repeats was always the same, but corresponding correlation coefficients (scatterplots) varied slightly. Results are presented as histograms, and as scatterplots with correlation coefficients indicated in the figure text. We also show equations of the regression lines. We mainly use random numbers with a uniform distribution, but also include some experiments with normal distribution of the random numbers. To find appropriate mean values, and corresponding standard deviations necessary to generate the random numbers, we were guided by previous measurements of fatty acids (10). The outcomes were qualitatively the same when using random numbers with uniform and normal distributions. SPSS 25.0 was used for the analyses, and for making figures. The significance level was set at p<0.05. In the text below, I present in more detail theoretical considerations, and computer analyses. In addition to an algebraic approach, I also present a geometric reasoning to explain correlations between percentages (see below).
We first involve 3 positive scale variables only, A, B, and C (among a large number, A, B, C….), each with a particular distribution (range, variability). C may, for example, be considered to represent the sum of the remaining fatty acids when omitting type A and B. Below is an attempt to present some theoretical considerations, and computer analyses.
On the other hand, if %C consists of very small values, we should expect a negative %A vs. %B association, since the equation in this case would approach % B=-%A+100. Additionally, we should anticipate positive or negative correlations between percentages of A and B also within a certain boundary around the above-mentioned conditions, but with poorer outcomes as the above-mentioned conditions are decreasingly complied with. Below I will exemplify some of these situations, including results of computer experiments.
One way to approximate this requirement is to let A and B both have a distribution with low numbers as compared with C, for example for A(B) from 0.10 to 0.15, and C a distribution involving high numbers, e.g. from 1.0 to 10.0. A computer-analysis verified that - with these distributions - the requirements above were valid; i.e. the major part of %C-distribution was high (95 to 97%; Figure 1, right panel); Median=95.7%; CV=(3.95/94.16)*100=4.2%. That the requirement (100-%C)>%A is valid, in accordance with the reasoning above, was verified by computer analysis (not shown). The association between percentages of A and B in the current example is shown in Figure 1, left panel; r=0.964, p<0.001, n=200. The equation of the regression line is:
We used a text mining method to analyze the database. Using the text mining software KH Coder (Windows version) [30], the number of nouns and adjectives appearing in the titles of the documents were counted for each word to obtain the number of appearances. In order to properly analyze related keywords, “α-lipoic acid”, “lipoic acid”, and “thioctic acid” were excluded from the extraction keywords at the time of analysis. In addition, when nouns were continuous, the number of occurrences was calculated as a compound word. Furthermore, we asked for the top 20 most frequent words and the top 20 most frequent compound words every 20 years. The Stanford POS Tagger [31,32] was used as a library of morphological analysis to extract words and compound words from the data.
%B=0.99 (0.02)*%A+0.07 (0.07), i.e. verifying that %B was close to %A.
Figure 1: Association between percentages of two low-number variables, A and B, both with distribution 0.10 - 0.15, and one high-number variable C with distribution 1.0-10.0; r=0.941 (p<0.001, n=200); regression line: %B=0.99 (0.02)*%A+0.07 (0.07)
This example seems to explain all of the positive correlations that we previously reported concerning %AA vs. percentages of n3 fatty acids [9,10]. Thus, the answer to the question above appears to be “yes”.
Slope of the regression line: Above we showed that there should be a positive association between %B and %A, if %C values were very high so that the expression (100-%C) approached zero. However, in this case it is inappropriate to write %B=%A, like Y=X. In the latter case, both the abscissa and the ordinate may have any value on the scale, and the Y vs. X graph would have slope=1. In contrast to this, %B and %A- values are limited by the B and A distributions (ranges), respectively. A more general equation would be: %B(p - q)=-%A(r - s)+(100-%C(t – u)) where the subscript parentheses indicate ranges of A, B, and C. The slope of the %B vs. %A regression line will accordingly be determined by the ranges of A(%A) and B(%B). Thus, if A and B both have the same distribution (range), then the slope should be close to 1, as observed above (Figure 1), with range 0.10-0.15 for both A and B, and 1-10 for C. With differing ranges for A and B, e.g. for A 0.20-0.40, and for B 0.10-0.15, and for C 1-10, the equation of the regression line was: %B=0.38 (0.01)*%A+0.22(0.10). We additionally did a manual calculation based upon minimum and maximum values of the A (%A) and B (%B) distributions. Using (max - min) of %B, divided by (max - min) of %A as a crude slope estimate, we found the slope values to be 1.09, and 0.38, respectively, in these two examples.
We rewrite the equation above: %A=-%C+(100-%B). Since we have given B (and accordingly %B) low values, we could roughly approximate the equation to %A=-%C+100%, which should give an inverse linear association between %A and %C, and a regression line that is crossing both axes at approximately 100%. The finding that extrapolation of the regression line did cross %C axis at 100 (figure not shown) is confirmed by the equation of the regression line: %A=-0.50(0.01)*%C+49.9(0.5). However, as expected, the assumption that %B approached zero was incorrect; extrapolation shows that the regression line crosses the %A- axis at 50%. Similarly, we may rewrite to %B=-%C+(100-%A), which approaches: %B=-%C+100. Hence, for % C vs. %A (B) we obtained: r=-0.905 (- 0.965); p<0.001 for both.
We consider again 3 scale variables, A, B and C, giving %A+%B+%C=100, i.e. %B=-%A+(100-%C). Thus, if %C approaches zero, then the equation would approach %B=-%A+100, and there should be close to a perfect linear inverse relationship between percentages of A and B. In our previous example, we used two “low-number variables” (A, B), and one with higher values (C). To obtain very low %C-values, we now generate random numbers with a distribution representing two high-number variables (A and B), for example both having distribution 1.0 to 7.0, and one low-number variable (C), e.g. with distribution 0.25 to 0.40. The %A vs. %B association is shown in Figure 2; r=-0.995; p<0.001; n=200. Regression line: %B=-0.99(0.01)*%A+95.4(0.35).
The histogram showed a %C - distribution of about 2-10% (not presented); therefore, the line does not completely fit the equation %B=-%A+100; extrapolation shows that the line will cross both axes at 95.4% instead of at 100%.
Since the condition (100%-%C)>%A seems to give a positive association between %A and %B, and a situation where %C approaches zero gives a negative %A vs. %B relationship, we would expect that, somewhere in-between these extreme conditions, the positive (negative) %A vs. %B association must change to become negative (positive). We name this anticipated condition The Turning Point. We present below examples of how to obtain a turning point, i.e. a change from positive (negative) to negative (positive) %A vs. %B association.
Figure 2: An example of the relationship between percentages of two variables (out of 3), both with distribution with high numbers (1.0-7.0), and a third variable (C) with distribution among lower numbers (0.25-0.40); r=0.995, p<0.001. Regression line: %B =-0.99 (0.01)*%A+95.4(0.35)
Will we obtain a Turning Point from positive to negative %A vs. %B correlation by progressively decreasing %C-values?: We first did a computer experiment in which we kept the A (B) distributions 0.10-0.15, but progressively lowered C-values (range) from a starting point of 1-10. By this alteration we obtained progressively lower %C-values, and accordingly also a progressive increase in percentages of A and B. Additionally, the correlation coefficient for the positive association between %A and %B should be increasingly attenuated as the %C–values decrease, and so did happen (not shown). It turned out that, with A (B) 0.10–0.15, and C 0.4–0.8, we obtained that the %A vs. %B correlation was still positive, r=0.444(p<0.001); %A vs. %C, r=-0.846(p<0.001); %B vs %C, r=-0.854 (p<0.001). The main %C-distribution was about 60-78%, and %A (B) 10-22%. However, keeping the same range of A (B), but lowering range of C to 0.4-0.6, we obtained that the previous positive %A vs. %B correlation had become non-significant, r=0.081 (p=0.302); %A vs. %C, r=-0.724(p<0.001); %B vs. %C, r=-0.747(p<0.001). The main %C-distribution had only a minor decrease, to 60-72%, and %A (B) had a minor increase to 12-22%. Next, we made a further small change in C-distribution, to 0.2-0.3, keeping the A (B) distributions unchanged. Now, the %A vs. %B association had turned from positive to become negative: r=- 0.261 (p=0.001); %A vs. %C, r=- 0.611(p<0.001); %B vs. %Cr=-0.605(p<0.001). The %C-distribution had fallen to 42-57%, and %A (B) had increased to 20-32%. Thus, at this level an apparently small adjustment of the C-distribution had the effect that a positive association between percentages of A and B turned to become negative, i.e. the Turning Point had been passed. We emphazise that these calculations are just some examples of how to obtain that a positive %A vs. %B association changes to become negative in response to changing distributions.
Will we obtain a Turning Point from negative to positive %A vs. %B correlation by progressively increasing %C-values?: We next investigated how a negative %A vs. %B association would respond to increasing the values of %C, starting with the previous A and B distribution 1-7, and C- distribution 0.25-0.40. Computer analyses showed that, by progressively increasing C- values, the negative %A vs. %B association became progressively poorer (as judged by the correlation coefficients, and scatterplots, not shown), but the negative association still prevailed when increasing the C-distribution to 1 to 10 (r=- 0.341, p<0.001). However, with C going from 1 to 17 there was no longer a significant correlation (r=-0.024, p=0.764), and with C going from 1 to 25, we obtained a positive %A vs. %B correlation (r=0.293, p<0.001); i.e. the turning-point had been passed. In this latter case we had obtained a situation with two small-number variables (A, B), and one high-number variable (C), i.e. the condition mentioned under 1) above.
This computer exercise shows that our presented algebraic consideration to evaluate the %A vs. %B outcome seems to work well when %C-values are very low, and may also be applied when %C values are very high. Furthermore, this type of reasoning shows that there must be a turning-point concerning positive and negative correlations, but it may be hard to find turning points without doing computer analyses.
With reference to the reasoning above and the equation %B=-%A+(100-%C), a progressive decrease (increase) in %C should tend to change the %A vs. %B association from being positive (negative) to become non-significant, and then - at a particular turning point - change to be negative (positive). This statement raises the question of how A (B, C) - narrowing from both sides, within a certain range, might influence the relationship between percentages of A and B. Furthermore, how will a broadening of a particular A (B, C) - range, to both sides, influence the %A vs. %B correlation?
Narrowing the distribution of A and/or B from both sides simultaneously: As shown above, a positive association between %A and %B is favored with low-number distributions of A and B, and a high-number distribution of C. This outcome relates primarily to the high values of %C encountered under this condition, raising the question of how a narrowing of A and/or B within a certain range would influence %C values. Narrowing a range is to decrease the variability. Accordingly, the number of both high and low values within the original distribution range of A (B), is expected to decrease, as is also the number of high and low percentages of A (B).
Decreased number of high %A (B) -values should favor more of high %C -values, whereas increased number of high %A (B) -values should favor less of high %C. This consideration raises the question of whether any of these %C -changes will prevail when narrowing A (B).
We accordingly did a computer experiment, starting with distribution for A 0.1-0.3, for B 0.2-0.4, and for C 1-10. Histograms of the distribution of percentages of A, B, and C are shown in Figure 3. The distribution of %A and %B are positively skewed, and that of %C is negatively skewed (Figure 3). The reason for the skewness is that we have a combination of low-number and high -number variables, implying that the distribution of percentages of the former will be positively skewed, whereas that of the latter one will be negatively skewed. To explain this outcome, we may consider two variables only; A with low-number distribution, e.g. close to 1, and B with a high -number distribution, e.g. 5 to 100. Since %A in this case is 100*A/ (A+B), the expression will approach %A=100/(1+B). Thus, for each unit increase in B, the denominator (and % A) changes more in the lower range of B than in the upper B -range. This means a high number of cases within each % A -unit increase in the lower end of the %A scale, as compared with the number of cases within each %A -unit increase in the upper end of the %A -scale. We would accordingly expect a %A (B) -histogram with a tail towards higher values. Consequently, the %C -histogram should have a tail towards lower values.
Figure 3: Histograms with A, B, C -distributions: A 0.1-0.3; B 0.2-0.4; C 1-10. Cutoff-values for quartiles of the percentages: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8
A computer-analysis verified that -with the current distributions - we have a positive association between %A and %B (Figure 4, left panel); r=0.811, p<0.001, n=200. %A (B) vs. %C: -0.931(-0.969). The major part of %C-distribution consisted of high values (90 to 96%, Figure 3), however not completely complying with the requirement (100-%C)=0. The slope of the regression (Figure 4, left panel) line was 1.20; equation: %B=1.20 (0.06)*%A+1.25 (0.32).
We next narrow A and B distributions from both ends, i.e. A 0.15-0.18, and B 0.30-0.35, with C-distribution unchanged, 1-10. The %A vs. %B association improved (Figure 4, right panel: r=0.992 p<0.001). The current narrowing of A and B seemed to move the distribution of %C slightly towards higher values (histograms not shown). Cutoff-values for quartiles after narrowing were: %A 2.0, 2.6, 4.0; %B 4.0, 5.2, 8.2; %C 87.6, 92.2, 94.1, as compared with before narrowing: %A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8. Since the %C -distribution seemed to move towards slightly higher values, we should expect that the association between percentages of A and B improved, as was verified by the scatterplot (Figure 4, right panel); regression line: %B=1.94 (0.02)*%A+0.14 (0.07).
Figure 4: Association between %A and %B; left panel is before narrowing, with A, B, C -distributions: A 0.1 -0.3; B 0.2 -0.4; C 1 -10. Regression line: %B=1.20 (0.06)*%A+1.25 (0.32).
Right panel is after narrowing, with distributions: A 0.15-0.18; B 0.30-0.35; C 1-10.
Regression line: %B=1.94 (0.02)*%A+0.14 (0.07)
Broadening the distribution of A (B) to both sides simultaneously: A broadening of the A (B) distribution should make the %A vs. %B scatterplot poorer, as inferred from the reasoning above. We did a computer experiment with A 0.05 to 0.50, B 0.10 to 0.80, and C unchanged (1.0 to 10.0). The %A vs. %B association was poorer (Figure 5). r=0.331 (p<0.001); regression line %B=0.58 (0.12)*%A+6.42 (0.75).
Figure 5: Association between %A and %B with distributions: A 0.05-0.50; B 0.10-0.80; C 1-10. %A vs. %B, r=0.331 (p<0.001). Regression line: %B=0.58 (0.12)*%A+6.42 (0.75)
%C -quartiles after broadening the A and B distributions were: %A 2.4, 4.5, 6.6; %B 4.9, 7.3, 11.7; %C 82.3, 88.3, 91.8, as compared with before broadening values : % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8. Thus, broadening A and B distributions moved the %C -distribution towards lower values, thereby explaining the poorer %A vs. %B association.
Narrowing the C-distribution from both sides simultaneously
According to the previous reasoning, and using the equation %B=-%A+(100-%C), there should be a positive association between %A and %B with high %C -values, but the scatterplot for the relationship should deviate increasingly more from giving a straight line as the expression (100 -%C) deviates from zero. We start with a new set of cases with distribution ranges: A 0.10-0.30; B 0.20-0.40; C 1.0-10.0. The distributions of %A and %B were skewed with tails towards low values, and %C with tail towards high values (Figure 6).
Figure 6: Histogram of %A, %B, and %C, when distributions of A, B, C are: A 0.1 - 0.3; B 0.2 - 0.4; C 1 - 10. Cutoff-values for quartiles of the percentages: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8
Cut-off values for %C -quartiles were: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8. %B=1.07 (0.06)*%A+1.90(0.31). %A vs. %B. We found a significant positive association between %A and %B; r=0.785, p<0.001 (n=200).
We next narrowed the C –distribution from both ends of the distribution, while keeping distributions of A and B; thus, for A 0.10-0.30; for B 0.20-0.40; and for C 4.0-6.0. The distributions of %A, %B, and %C became less skewed, and more bell-shaped (Figure 7).
Figure 7: Histogram of %A, %B, and %C, after narrowing C -distribution, while keeping distributions for A and B, i.e. A 0.1 - 0.3; B 0.2 - 0.4; C 4 - 6. Cutoff-values for quartiles of the percentages: % A 3.0, 3.8, 4.7; %B 4.5, 5.3, 6.5; %C 89.6, 90.8, 91.9, as compared with before narrowing: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8
Cutoff-values for quartiles of the percentages after narrowing C –distribution were: % A 3.0, 3.8, 4.7; %B 4.5, 5.3, 6.5; %C 89.6, 90.8, 91.9 , as compared with corresponding values before narrowing: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8. Thus, the narrowing had caused less of the highest %C -values, and more of the lowest ones, an effect that should make the %A vs. %B scatterplot poorer (Figure 8), as explained above. %B vs. %A, r=0.012 (p=0.863, n=200).
Figure 8: Scatterplot of %A vs. %B after narrowing the C -distribution; A 0.10 - 0.30; B 0.20 - 0.40; and C 4.0 -6.0. %A vs. %B, r=0.012 (p=0.863, n=200)
Broadening the C-distribution to both sides: We broadened the distribution range of C to 0-30, keeping ranges of A and B, i.e. A 0.10-0.30; B 0.20-0.40; C 0-30. The distribution curve of %C became strongly skewed towards high %C -values, with a tail to the left (Figure 9, left panel), giving the expression (100-%C) low values, and thereby favoring a positive %A vs %B - association (Figure 9, right panel). Quartiles after broadening C -distribution were: %A 0.8, 1.3, 2.3; %B 1.3, 1.9, 4.0; %C 93.9, 96.8, 97.8, as compared with cut-off values for %C -quartiles before broadening C -distribution: % A 2.4, 3.4, 5.4; %B 3.7, 5.3, 8.0; %C 86.7, 91.3, 93.8. %A vs. %B: r=0.942, p<0.001 (n=200); %A (B) vs. %C: r=-0.977 (- 0.992), p<0.001. Regression line: %B=1.58 (0.04)*%A+0.02 (0.20).
Thus, broadening of the C-distribution improved the %A vs. %B association.
Figure 9: Histogram of %C (left panel), and scatterplot of %A vs. %B (right panel). Distributions: A 0.10-0.30; B 0.20-0.40; C 0-30. %A vs. %B, r=0.939 (p<0.001). %A vs. %C, r=-0.980; %B vs. %C r=- 0.989, n=200. Regression line: %B=1.25 (0.03)*%A+0.59 (0.18)
It might be questioned whether the previous results are influenced by the uniform distribution of the random numbers. We therefore present below some examples based upon random numbers with normal distribution, and using Spearman’s rho to test significance of correlations. We start with normal distribution of two low-number variables, A and B, and one high-number variable (C), as compared with A(B). Previous data of a diet trial [9] were used to find suitable mean (SD) levels. Thus, we generated random numbers using the following mean (SD) values: for A 0.20 (0.03), for B 0.30 (0.04), and for C 8.4 (2.6). Histograms of A, B, and C, and of their percentage amounts are shown in Figure 10. Cutoff-values for quartiles of the percentages were: % A 1.8, 2.2, 2.7; %B 2.7, 3.3, 4.0; %C 93.4, 94.4, 95.4).
Figure 10: Histogram of A, B, C (upper panels) and of percentages of A, B, and C (lower panels). Random numbers with normal distribution were used, and based upon mean value (SD), i.e. for A 0.20(0.03), for B 0.30(0.04), and for C 8.4(2.6). Cutoff-values for quartiles of the percentages: % A 1.8, 2.2, 2.7; %B 2.7, 3.3, 4.0; %C 93.4, 94.4, 95.4
Scatterplot with regression line for the %A vs %B association is shown in Figure 11. Spearman’s correlation for %A vs. %B: rho=0.730 (p<0.001, n=200). Regression line: %B=1.40 (0.05)*%A+0.32 (0.14). Thus, also with normal distribution of two low-number variables, and of one high-number variable, there was a positive association between percentages of the two variables with low-number distribution, in accordance with the previous outcome using random numbers with uniform distribution.
Figure 11: Scatterplot with regression line for the %A vs %B association when random numbers were generated with normal distribution, based upon the following mean (SD) values, i.e. for A 0.20 (0.03), for B 0.30 (0.04), and for C 8.4 (2.6). Regression line: %B=1.40 (0.05)*%A+0.32 (0.14). Spearman’s correlation for %A vs. %B: rho=0.730 (p<0.001, n=200)
We next narrowed the A and B distributions by changing SD, i.e. for A from SD=0.02 to SD=0.01, and for B from 0.03 to 0.02, while keeping C distribution. Thus, the conditions (mean, SD) were: A 0.20(0.01), B 0.30(0.02), and C 8.4(2.6). Distributions of A, B, and C, as well as the distribution of percentages of A, B, and C are shown in Figure 12.
Figure 12: Histograms of A, B, C -distributions (upper panels), and distribution of their percentages (lower panels). The histograms were produced using random numbers with normal distribution, based upon mean value (SD); i.e. for A 0.20 (0.01), for B 0.30 (0.02), and for C 8.4 (2.6). Cutoff-values for quartiles of the percentages: % A 1.8, 2.2, 2.8; %B 2.8, 3.4, 4.1; %C 93.0, 94.4, 95.5, as compared with before narrowing: % A 1.8, 2.2, 2.7; %B 2.7, 3.3, 4.0; %C 93.4, 94.4, 95.4
Spearman’s correlation for the association between %A and %B was: rho=0.947(p<0.001), n=200, Figure 13, left panel. Thus, in spite of rather similar cutoff values for quartiles of %A, %B, and %C, the scatterplot seemed slightly improved, as also indicated by a change in the correlation coefficient. Thus, the narrowing of a normal distribution of A and B improved the association between their relative amounts, as previously observed with random numbers having uniform distribution.
Figure 13: Left panel: Scatterplot with regression line for the %A vs %B association when random numbers were generated with normal distribution, based upon the following mean (SD) values; i.e. for A 0.20 (0.01), for B 0.30 (0.02), and for C 8.4 (2.6). Regression line: %B=1.50(0.02)*%A+0.03(0.07). Spearman’s correlation for %A vs. %B: rho=0.947 (p<0.001, n=200)
Right panel: Scatterplot with regression line for the %A vs. %B association when random numbers are generated with normal distribution and with very low variability of A and B; the following mean (SD) values were, for A 0.20 (0.001), for B 0.30 (0.002), and for C 8.4 (2.6). Rho=0.999, p<0.001, n=200. Regression line: %B=1.50 (0.002)*%A-0.001 (0.007)
We finally made an extreme narrowing of A and B, i.e. for A 0.20 (0.001), and B 0.30 (0.002), with C unchanged, i.e. 8.4 (2.6). The effect of this greatly decreased variability in A (CV=0.5%) and B (CV=0.7%) was to obtain an almost perfect positive %A vs. %B association (rho=0.999, n=200, Figure 13, right panel). In this case the quartiles of A, B, and C percentages were: %A 1.9, 2.3, 2.8; %B 2.9, 3.5, 4.2; %C 93.1, 94.1, 95.2, i.e. not differing much from those before narrowing: % A 1.8, 2.2, 2.7; %B 2.7, 3.3, 4.0; %C 93.4, 94.4, 95.4. It would appear that, at these high levels of %C and low levels of %A and %B, the distribution effect upon the %A vs. %B association is not well explained by the equation %B=-%A+(100-%C). On the other hand, estimation by quartiles might not be sensitive enough to evaluate small changes in distributions.
With a large number of variables, the total sum of percentages is 100; %A+%B+%C+ ……=100. In order to find some general rules governing the association between percentages of A and B, we previously simplified, first to involve 3 variables: %A+%B+%C=100, where C could be regarded as the sum of all variables, except A and B. But will the %A vs. %B association be influenced by this simplification? We therefore did a computer experiment with 5 low-number variables, all with SD=10% of the mean value, and one high-number variable (F), with SD=one third of the mean (reason for choosing different variabilities, see above), i.e. A 0.30 (0.03), B 0.20 (0.02), C 0.18 (0.018), D 0.40 (0.04), E 0.15(0.015), and F 9.0(3.0). We calculated the sum, S of all variables A to F, and also R=the remaining sum of all variables, except A and B, i.e. R=S–A-B. Next we calculated the mean (SD) for R=19.6 (5.9). Thus, S=A+B+R, and percentages are: %A=100*A/S; %B=100*B/S; %R=100*R/S. We generated random numbers with normal distribution for 1) all of the original variables A to F, and 2) a new set of random numbers for each of A, B, and R. Correlations between %A and %B before and after simplification were: rho=0.866 and 0.927, respectively (n=200). Corresponding equations for the %A vs. %B regression lines were %B=0.67 (0.02)*%A+0.05 (0.06), and %B=0.64 (0.02)*%A+0.07 (0.03), respectively. It would appear, accordingly, that under the conditions of the current experiment, the “3-variable-simplification” did not have any major effect on the association between percentages of A and B. This outcome seems likely since the same S was used in both calculations, and R includes the total variability of C to F.
When considering the presented calculations we may think of a cake that is divided into three pieces, each representing percentage contributions of A, B, and C. There are many ways to divide the cake into 3 pieces. However, as explained above, we should expect a positive association between percentages of two low-number variables A and B, both with a low variability, and one high-number variable (C) with a broader distribution. On the other hand, with two large -number variables, A and B, and one low-number variable (C) we should expect a negative association between %A and %B. This outcome may be understood also by a geometric reasoning, as presented below.
We name the 2 low-number variables, A and B, and the one with higher numbers C; thus % A (piece A)+% B (piece B)+%C (piece C)=The whole cake (Figure 14). The size of each of the 3 pieces should vary, but only within certain boundaries, determined by the distribution (range, variability) of A, B, and C.
Figure 14: Pie chart to illustrate percentages of two low-number variables (A with distribution from 0.10 to 0.15, in blue), and B 0.18 to 0.25, in red) and one high-number variable (C, in green; distribution 1 to 10)
If piece C progressively increases from lowest to highest value, then A and B pieces will progressively decrease, in order to compensate for the piece C expansion. Since the sum of the pieces is always 100%, piece C will be negatively correlated with each of A and B pieces. Furthermore, since pieces of A and B both decrease as piece C increases, there will be a positive correlation between A and B pieces. Furthermore, in response to progressively extending the C-distribution, pieces of A and B will be increasingly compressed, thereby improving the positive association between them. Indeed, if C-expansion is very large (close to 100%), then each of the A and B pieces would approach zero, and their association would approach a straight line with scatterplot points close to the regression line. Conversely, a narrowing of the C-distribution will decrease piece C, increase those of A and B, and cause a poorer scatter for the %A vs. %B association. Additionally, a narrowing of the distributions for A, B or both will increase %C, decrease %A and %B, and improve the association between them. The opposite will happen when increasing the A- or B- distributions. Choosing distribution 0.10 to 0.15 for A and 0.18 to 0.25 for B, and 1 to 10 for C, a computer-analysis showed the following Pearson r-values (n=200): %A vs. %B, r=0.967(p<0.001); %A vs. %C, r=-0.987(p<0.001); %B vs. %C, r=-0.996(p<0.001). We also examined the effect upon the %A vs. %B association of narrowing and broadening the C-distribution; the changes were as predicted above. This geometric reasoning may be used to explain our previously reported positive correlations between percentages of arachidonic acid and each of the n3 fatty acids EPA, DPA, and DHA [9,10].
Figure 15: Pie chart to illustrate percentages of two “high-number variables” (A with distribution 1 to 7, in blue, and B with distribution 1 to 10, in red), and one “low-number variable” (C in green, with distribution 0.1 to 0.3)
In this case (Figure 15), we would expect a negative association between piece A and B, since an increase in piece A (B) from lowest to highest value within a relatively broad distribution must be compensated for by a corresponding decrease in piece B (A), and possibly by a minor decrease in piece C. However, there should be no strong correlation between %C and %A (B) since even a major decrease in piece C would be insufficient to compensate for a progressive increase in piece A or B from their lowest to highest values. Choosing distribution 0.10-0.30 for C, 1-7 for A, and 1-10 for B, a computer analysis showed the following Pearson r-values (n=200): %A vs. %B; r=-0.998(p<0.001); %A vs. %C; r=0.140 (p=0.048); %B vs. %C; r=-0.203 (p=0.004).
A special case is with 3 equal pieces (equal distributions of all piece, n=200); A correlates negatively with B (r=-0.572, p<0.001) and C (r=-0.423, p<0.001), and B correlates negatively also with C (r=-0.498, p<0.001).
Thus, we may get an idea of how %A and %B will be related by considering the equation %B=-%A+(100-%C), and knowing distributions of A, B, C, and of their percentages We may also conceive how distributions per se govern the association between percentages of A, B, and C by a geometric reasoning.
It is not surprising that percentages may be correlated, if they are computed from the same sum. Indeed, as early as in 1897 Karl Pearson [11] reported that there will be a spurious correlation between two indexes with the same denominator, even if the variables used to produce the indexes are selected at random with no correlation between them. However, the present analyses with random numbers show that significant correlations between percentages of the same sum are not always obtained, and correlations may be positive or negative depending on the distribution (range, variability) of the variables. That distribution is crucial for the outcome was demonstrated by the appreciable changes in the presented scatterplots, caused by changes in distribution of the variables involved. Our previous [9,10] and present calculations with random numbers raise the general question of whether the concentration distribution per se, of for example fatty acids, does govern whether their relative amounts are positively or negatively associated, or not related at all. It is tempting to speculate whether the mathematical rules governing Distribution dependent correlations might have a general relevance when studying the association between relative abundances of variables, in biology, nutrition, physics, chemistry, and in social sciences. Thus, if we know ranges and variabilities, then we may possibly predict whether or not relative abundances are positively or negatively associated, or non-existing.
The present computer experiments involve only some examples of many possible distributions (ranges) of scale variables. It would seem of interest to include other distributions as well, preferably those encountered in physiology and pathology. Although the mathematical rules governing Distribution dependent correlations seem reasonably well accounted for in the present work, we do not know the possible physiological applicability of the results, e.g. as related to the distribution of variables in organs, tissues, cell compartments, and in various species, including man. Future work in this field should include studies to explore whether (to what extent) Distribution dependent correlations really are used as a physiological regulatory mechanism. Furthermore, more general mathematical models should be developed, suitable to predict positive (negative) correlations between percentages. Such rules should also serve to define the detailed requirements needed to obtain the above-mentioned “turning-points”.
The present work seems to provide some general rules for Distribution dependent correlations:
1) With 3 positive scale variables, two of which having low-number distribution, and low variability, as compared with the third variable, we might expect a positive association between percentages of the low-number variables, and a negative association between percentage of the high-number variable and percentage of each of the low-number variables.
2) A decrease (increase) in the variability of either or both of the two low-number variables will improve (make poorer) the association between their relative amounts. In contrast, a narrowing (broadening) of the distribution of the high-number variable will make poorer (improve) the association between percentages of the low-number variables.
3) With 3 positive scale variables, two of which having high-number distribution as compared with the third variable, we might expect a strong negative association between percentages of the two high-number variables, and probably a poor correlation between percentage of the low-number variable and percentage of each of the high-number variables.
4) Distributions of 3 scale variables can be manipulated so as to obtain a Turning Point, i.e. a situation where a positive (negative) association between percentages of two of them turns to become negative (positive).
5) With more than 3 variables, we may calculate the sum of all, minus the two under investigation; the preceding rules may apply with this “3-variable-modification”.
6) Distribution dependent correlations may be used as a regulatory mechanism in physiology.
- 1. Mayes PA (2000) Metabolism of unsaturated fatty acids and eicosanoids. In: Harper’s Illustrated Biochemistry, McGraw-Hill, New York, USA.
- 2. Smith WL, Murphy RC (2008) The eicosanoids: cyclooxygenase, lipoxygenase and epoxygenase pathways. In: Biochemistry of Lipids, Elsevier, UK.
- 3. Gogus U, Smith C (2010) n-3 Omega fatty acids: a review of current knowledge. Int J Food Sci Tech 45: 417-36.
- 4. Alvheim AR, Malde MK, Osei-Hyiaman D, Lin YH, Pawlosky RJ et al. (2012) Dietary linoleic acid elevates endogenous 2-AG and anandamide and induces obesity. Obesity 20: 1984-94.
- 5. Kremer JM, Bigauoette J, Lininger L, Huyck C, Michalek AV, et al. (1985) Effects of manipulation of dietary fatty acids on manifestations of rheumatoid arthritis. Lancet 325: 184-7.
- 6. Lorenz R, Weber PC, Szimnau P, Heldwein W, Strasser T, et al. (1989) Supplementation with n-3 fatty acids from fish oil in chronic inflammatory bowel disease -a randomized, placebo-controlled, doubleblind cross-over trial. J Intern Med Suppl 731: 225-32.
- 7. Kromhout D, Bosschieter E, Coulander CL (1985) The inverse relation between fish consumption and 20-year mortality from coronary heart disease. N Engl J Med 312: 1205-9.
- 8. Nagata M, Hata J, Hirakawa Y, Mukai N, Yoshida D, et al. (2017) The ratio of serum eicosapentaenoic acid to arachidonic acid and risk of cancer death in a Japanese community: The Hisayama Study. J Epidemiol 27: 578-83.
- 9. Høstmark AT, Haug A (2018) The Fatty Acid Distribution per se Explains Why Percentages of Eicosapentaenoic Acid (20:5 n3) and Arachidonic Acid (20:4 n6) are Positively Associatsed; a Novel Regulatory Mechanism? J Nutr Diet Suppl 2: 1-10.
- 10. Høstmark AT, Haug A (2019) Associations Between %AA (20:4 n6) and Percentages of EPA (20:5 n3), DPA (22:5 n3), and DHA (22:6 n3) Are Distribution Dependent in Breast Muscle Lipids of Chickens. J Nutr Diet Suppl 3: 1-9.
- 11. Pearson K (1897) Mathematical contributions to the theory of evolution. -On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60: 489-96.