5 changed files with 20 additions and 41 deletions
--- a/paper/agujournaltemplate.tex
+++ b/paper/agujournaltemplate.tex
--- a/paper/coverletter.tex
+++ b/paper/coverletter.tex
@ -6,7 +6,6 @@
 \newcommand{\markme}[1]{#1}
 \newcommand{\ac}{a_{\textrm{col,p-p}}}
 \newcommand{\Tp}{T_{\textrm{p}}}
-\newcommand{\nO}{n_{O^{7+}}/n_{O^{6+}}}
 \newcommand{\verena}[1]{\textcolor{blue}{#1}}
 \newcommand{\sophie}[1]{\textcolor{green}{#1}}
 \newcommand{\reply}[1]{\textbf{Reply:} #1 \newline
@ -385,7 +384,7 @@ The study is potentially interesting but, in my opinion, the methodology, result

 \reply{Thank you for your critical comments. We have extensively revised the manuscript to improve the readability and to sharpen the discussion. In particular we moved most of the detailed explanation of the results into the appendix and instead replaced Section 4 with a short summary of the most relevant results.}

-\commentA{I have major issues with the methodology that seems to follow a brute force strategy. Although the authors claim to not use any ground-truth for solar wind classification, they eventually use their 7-parameters classification as ground truth (or a statistical average of an ensemble of k-means runs). Even assuming that that is a sensible choice, they end up running thousands of k-means runs (with and without noise, etc.) In my opinion the whole problem could have been more elegantly solved by using sensitivity analysis tool (the sensitivity of each individual parameter is, at the end of the day, what the authors are looking for). }
+\commentA{I have major issues with the methodology that seems to follow a brute force strategy. Although the authors claim to not use any ground-truth for solar wind classification, they eventually use their 7-parameters classification as ground truth (or a statistical average of an ensemble of k-means runs). Even assuming that that is a sensible choice, they end up running thousands of k-means runs (with and without noise, etc.) In my opinion the whole problem could have been more elegantly solved by using sensitivity analysis tool (the sensitivity of each individual parameter is, at the end of the day, what the authors are looking for). Moreover, once the centroids of the "ground-truth" have been established, it would be relatively simple to look (analytically) at the relative importance of each parameters, since k-means uses a simple Eucledian distance.}

 \reply{

@ -412,10 +411,9 @@ The study is potentially interesting but, in my opinion, the methodology, result
     Thus, we can replace one part of our approach, the Monte Carlo simulations for each input parameter combination, with a standard sensitivity analysis tool. However, this would not decrease but rather increase the computational effort of our study without much gaining additional information. The core part of our study, the importance of each input parameter for solar wind classification, cannot be addressed by sensitivity analysis.
     \item Feature importance and explainability methods \citep{solorio2020review,kumar2014feature,alelyani2018feature}. The third research area related to our approach is feature importance (which aims at evaluating the importance of individual features on the final result necessarily without eliminating the features) and explainability In particular explainability is a rather young research field that has gained importance with the rise of successful deep learning applications. As a result, these methods are usually tailored to considerably more complex cases than $k$-means and focus on large numbers of features. Thus, again many approaches focus on explaining individual features (for example, \citet{zhuang2019decision}, or  with SHAP values \citet{kumar2020problems, pudil1995feature}). Similar to embedded or search-based feature selection methods, the available explainability approaches attempt to deal with the situation where wrapper-like approaches are not feasible. Thus, they provide only explainability for a subset of features. This again is not helpful for our purposes. These methods tend to be less informative than our wrapper-like approach and have a higher  computational cost.
   \end{itemize}
-   
  %https://salib.readthedocs.io/en/latest/index.html
-\verena{todo: We added a section discussing alternative approaches to the introduction. add \url{https://arxiv.org/pdf/2105.08053.pdf}}
-}
+\verena{todo: We added a section discussing alternative approaches to the introduction.}
+

 %  example with ACO: \citet{al2005feature}.
 %  example for feature importance (single features) with good comparison \citet{saarela2021comparison}
@ -430,49 +428,30 @@ The study is potentially interesting but, in my opinion, the methodology, result
 % using unsupervised clustering as feature selection: \citet{mitra2002unsupervised}
 % promising reviews:

- \commentA{Moreover, once the centroids of the "ground-truth" have been established, it would be relatively simple to look (analytically) at the relative importance of each parameters, since k-means uses a simple Eucledian distance.}
+\verena{todo:  exaplainable AI for k-means: \citet{frost2020exkmc}}

- \reply{Indeed, the final cluster centers allow for a simple description of the respective clustering. 
-However, the approach suggested here would again only answer the second part of our requirements of our approach (how do uncertainties in the input parameters affect the results) and not the first part (how important is it for the final classification to include specific parameters and combinations of parameters in the input parameter set). In addition, formally the cluster centers determined by $k$-means depend directly on all 238047 data points in the data set and their individual uncertainties.  
+  %https://books.google.de/books?hl=de&lr=&id=jBm3DwAAQBAJ&oi=fnd&pg=PP1&ots=EgzU-jEIV-&sig=NyF9KNZrJKZpeeVijWtvReoOAOQ#v=onepage&q&f=false
+
+%  very different application but interesting: \citet{lee2022comparison}
+
+\verena{todo: reply to this part `` Moreover, once the centroids of the "ground-truth" have been established, it would be relatively simple to look (analytically) at the relative importance of each parameters, since k-means uses a simple Eucledian distance.'':
+Such an aproach would again only answer the second part of our requirements of our approach (how do uncertainties in the input parameters affect the results) and not the first part (how important is it for the final classification to include specific parameters and combinations of parameters in the input parameter set?)
+}
 }

 \commentA{Instead, the authors choose different comparison criteria, which are not very well introduced or explained, and end up giving contradictory results.}
-We consider two types of criteria: (1) criteria to evaluate the quality of the reference clusterings (mean inner cluster distance, the Calinksi-Habaraz score and the Davies-Bouldin scores) and (2) similarity measures to compare different clusterings (the Folkes-Mallows score, the adjusted Rand score, and the normalzied and adjusted mutual information scores \verena{todo: add citations for the scores}. All our chosen criteria are frequently used in the machine learning literature \verena{to do: add citations heres}). \verena{todo: } For the sake of clarity we extended their respective description in the manuscript. A priori, there is no reason why any of these criteria should be more appropriate for our solar wind clustering application than any of the others. Therefore, to avoid bias, we included all of them. Since the two mutual information scores rarely deviate from each other, we removed one of them and moved the detailed description of the differences in the results between the different similarity scores to the new appendix. The appendix also contains the definitions and a short discussion of the expected differences between the respective score\verena{das war, worauf wir uns geeinigt hatten, oder?} 
+\verena{todo: explain our four similarity criteria}

 \subsection*{Minor comments:}
 \commentA{The English needs to be polished and some sentences/concepts are repeated too many times. E.g. line 35: cancel "previously mentioned"; line 42: hole $\longrightarrow$ holes; line 71: remain $\longrightarrow$ remains; line 72: is $\longrightarrow$ are; line 108: need $\longrightarrow$ needs, etc.}

-\reply{Thank you for your suggestions! We corrected these cases and carefully revised the manuscript to polish and improve the language quality. We also removed repetitions where possible.}
+\reply{Thank you for your suggestions! We corrected these cases and carefully revised the manuscript to polish and improve the language quality. We also removed repetitions where possible. \verena{line 42 war schon richtig. line 71 ist remain richtig, die anderen beiden habe ich angepasst.}}

-  \commentB{Line 51 and elsewhere}{use consistently element symbol or word.}
+Line 51 and elsewhere: use consistently element symbol or word.
+Line 164: Bloch et al. did not use k-means but Bayesian Gaussian Mixture;

-  \reply{Thank you for this comment. We adopted the following strategy:
-    On the abstract and plain language summary only we use the expression ``the ratio between the densities of $O^{6+}$ and $O^{7+}$''. In Section 1, we define the term ``Oxygen charge state ratio'' and the corresponding symbol $\nO$. In the following, we now always use the symbol. \verena{apply and check this!}}
+Eq.1 : write units in parenthesis, after formula

-    \commentB{Line 164}{ Bloch et al. did not use k-means but Bayesian Gaussian Mixture;}
-
-    \reply{We removed the \cite{bloch2020data} reference from this sentence and line 281.
-
-      We interpreted the following remark from Section 5 of \cite{bloch2020data} such that they applied both Bayesian Gaussian Mixture (which can be regarded as a generlization of $k$-means) and $k$-means (although the results of this comparison are not shown in the paper):}
-
-      \begin{quote}
-To test the validity of the above arguments against using k-means, we have investigated
-how the results differ from the BGM scheme. Overall, the results from k-means are qualita-
-tively the same as those from the BGM (i.e. the majority of data are assigned the same class
-in both schemes), but with drawbacks. Such drawbacks include an apparent increase in the
-mis-classification of Ulysses CHW data, and incongruent speed distributions for the unclas-
-sified data between Ulysses and ACE. These differences are due to the comparatively poor
-way of determining classification boundaries, and the changes in the objective functions be-
-ing optimised. These differences both highlight that k-means is less suited to classification
-in the way we have applied the BGM.
-      \end{quote}
-
-
-    \commentB{Eq.1}{write units in parenthesis, after formula}
-
-    \reply{ For the sake of consistency we added the unit after the formula as \textit{in $ \frac{K^{3/2} s^2}{{cm}^{-3} km}$}.}
-
-    
 Line 239 and ff: are the different measurement uncertainties considered independent? This should be explained

 Figure 1: I do not see any error bar
--- a/paper/introduction_agu.tex
+++ b/paper/introduction_agu.tex
@ -140,7 +140,7 @@ combinations that need to be analyzed and compared. We restricted
 this study to this selection of solar wind parameters, to keep the
 computational cost manageable and - more importantly - to allow for
 a detailed and concise discussion of the results.  As in
-\citeA{heidrich2018solar}\markme{\cancel{\citeA{bloch2020data}}} we chose $k$-means as our
+\citeA{heidrich2018solar, bloch2020data} we chose $k$-means as our
 unsupervised machine learning method. Our approach allows to address
 several aspects: (1) Investigate whether transport affected solar wind
 properties are sufficient to identify the solar source region or
--- a/paper/methods_agu.tex
+++ b/paper/methods_agu.tex
@ -41,7 +41,7 @@ contains \delete{with the} $\nOsix$ the most frequent solar wind ion (heavier th

   Furthermore, we also consider the proton-proton collisional age ($\colage$, also called Coulomb number) which estimated the number of $90^{\degree}$-equivalent proton-proton collisions in the plasma as an additional parameter. As shown in \citeA{kasper2012evolution,heidrich2020proton}, the proton-proton collisional age (in the following also referred to as collisional age) is a suitable ordering parameter that summarizes the collisional transport history of the solar wind. We compute the collisional age in the same way as in \citeA{heidrich2020proton}:
   \begin{equation}
-   \colage = \bf 6.4 \times 10^{8}\frac{\n}{\vsw \T^{3/2}} \qquad \textrm{in} \qquad \frac{K^{3/2} s^2}{{cm}^{-3} km}
+   \colage = \frac{6.4 \times 10^{8} K^{3/2} s^2}{{cm}^{-3} km}\frac{\n}{\vsw \T^{3/2}}
    \label{eq:colage}
    \end{equation}

@ -90,7 +90,7 @@ contains \delete{with the} $\nOsix$ the most frequent solar wind ion (heavier th

 
   \subsection{$k$-means}\label{sec:kmeans}
-   $k$-means \cite{lloyd1982least} is a simple algorithm to perform clustering, which is, unsupervised classification. The $k$-means algorithm starts by choosing an initial guess as the first cluster centres. In the next step, every point in the dataset is assigned to one of the $k$ clusters by computing the Euclidean distance to all cluster centres. The closest cluster centre determines the cluster assignment. After this step, new cluster centres are obtained by calculating the mean over each current cluster. The last two steps are iterated until the cluster centres converge. The algorithm was implemented using python version 3.9.2 and scikit-learn version 1.0.2 \cite{scikit-learn}. To determine the initial cluster centres, \texttt{kmeans++} was used. \texttt{kmeans++} is an extension to $k$-means wherein the starting points are not randomly chosen but determined with an underlying algorithm to improve the convergence speed of $k$-means.  A similar analysis as presented in this study applied to  other clustering methods could lead to different, potentially complementing results.  $k$-means implicitly makes the assumption that clusters are normal distributed - and therefore convex - and well separated. Since solar wind times sometimes show a continuous transition in the solar wind properties from one type solar wind type to another, these solar wind types cannot be expected to be very well separated and, as discussed Sec.~\ref{sec:elbow}, cluster cannot be assumed to be convex. Nevertheless, previous work \cite{heidrich2018solar,amaya2020visualizing} \markme{\cancel{\citeA{bloch2020data}}} has shown that $k$-means is a reasonable choice for the classification of solar wind. We chose $k$-means as our test case for solar wind classification for several reasons: $k$-means is a simple purely data driven method that is reasonably robust even if the underlying assumptions are stretched. Many well-tested implementations are available, which simplifies to reproduce our approach. We can build on available previous studies  \cite{heidrich2018solar,bloch2020data,amaya2020visualizing} that have already explored  the resulting clusterings which allows us to focus on our research question, that is, how relevant different solar wind parameters are for solar wind classification.
+   $k$-means \cite{lloyd1982least} is a simple algorithm to perform clustering, which is, unsupervised classification. The $k$-means algorithm starts by choosing an initial guess as the first cluster centres. In the next step, every point in the dataset is assigned to one of the $k$ clusters by computing the Euclidean distance to all cluster centres. The closest cluster centre determines the cluster assignment. After this step, new cluster centres are obtained by calculating the mean over each current cluster. The last two steps are iterated until the cluster centres converge. The algorithm was implemented using python version 3.9.2 and scikit-learn version 1.0.2 \cite{scikit-learn}. To determine the initial cluster centres, \texttt{kmeans++} was used. \texttt{kmeans++} is an extension to $k$-means wherein the starting points are not randomly chosen but determined with an underlying algorithm to improve the convergence speed of $k$-means.  A similar analysis as presented in this study applied to  other clustering methods could lead to different, potentially complementing results.  $k$-means implicitly makes the assumption that clusters are normal distributed - and therefore convex - and well separated. Since solar wind times sometimes show a continuous transition in the solar wind properties from one type solar wind type to another, these solar wind types cannot be expected to be very well separated and, as discussed Sec.~\ref{sec:elbow}, cluster cannot be assumed to be convex. Nevertheless, previous work \cite{heidrich2018solar,bloch2020data,amaya2020visualizing} has shown that $k$-means is a reasonable choice for the classification of solar wind. We chose $k$-means as our test case for solar wind classification for several reasons: $k$-means is a simple purely data driven method that is reasonably robust even if the underlying assumptions are stretched. Many well-tested implementations are available, which simplifies to reproduce our approach. We can build on available previous studies  \cite{heidrich2018solar,bloch2020data,amaya2020visualizing} that have already explored  the resulting clusterings which allows us to focus on our research question, that is, how relevant different solar wind parameters are for solar wind classification.

   To mitigate the sensitivity of $k$-means to outliers, the data set is scaled using \texttt{scikit-learn}'s \texttt{RobustScaler}. After an initial study which showed that  - for solar wind classification -  the resulting clusterings are very stable against changes in the hyperparameter settings, $k$-means is initialized with the following hyperparameters \texttt{n\_ init}: $10$, \texttt{tol}: $0.001$, \texttt{algorithm}: \emph{auto}, and \texttt{max\_iter}: $1000$. 
   
--- a/paper/n_solar_wind_agu.tex
+++ b/paper/n_solar_wind_agu.tex
@ -19,7 +19,7 @@ In $k$-means applied to solar wind classification the pre-chosen number of clust
 Even though the computations are performed for all the different number of clusters from two to twelve, in the following we focus on two cases: three and seven clusters. Based on Fig.~\ref{fig:elbow}, we chose $k=3$ clusters as an interesting case since this corresponds to the maximum of the Calinski-Harabasz score and a minimum of the Davies-Bouldin score. This indicates that the resulting clusters are convex. In addition, $k=3$ allows a direct comparison to the \citeA{xu2014new} classification. We chose $k=7$ as a representative for the elbow in the MICD in Fig.~\ref{fig:elbow}. We also analyzed the results of all clusterings with $k=2,\dots,13$ to ensure that the qualitative results do not depend on the particular choice of $k$.

 \subsection{Reference solar wind clustering based on all input parameters}\label{sec:all_para}
-The focus of this study lies not on analyzing the solar wind types produced by $k$-means (\markme{or other unsupervised clustering methods, }for this question we refer to the literature \cite{amaya2020visualizing,heidrich2018solar,bloch2020data}). Nevertheless, in this section we provide a tentative interpretation of our reference clustering.
+The focus of this study lies not on analyzing the solar wind types produced by $k$-means (for this question we refer to the literature \cite{amaya2020visualizing,heidrich2018solar,bloch2020data}). Nevertheless, in this section we provide a tentative interpretation of our reference clustering.
 Thereby and to ensure a meaningful comparison in the following section, we here describe the solar wind clustering obtained by $k$-means based on the full parameter combination for three and seven clusters. As described in Sect.~\ref{sec:method}, we estimate the stability of $k$-means by retraining 100 independent trials. We compute the similarity scores between the clustering from each trial and the arbitrarily chosen first trial. For three clusters, the respective median is for the adjusted-rand score $0.992$ ( $0$th to $100 $th percentiles : $[0.981, 0.999]$), for the adjusted mutual-information score $0.981$ ($[0.964, 0.998$)], for the normalized mutual-information score $0.981$ ($[0.964, 0.998]$) and for the Folkwes-Mallow score $0.995$ ($[0.989, 0.999]$). For seven clusters the similarities between the trials are: adjusted rand score $0.877$ ( $0$th to $100 $th percentiles: $[0.740, 0.992]$), adjusted mutual information score $0.877$ ($[0.766, 0.996]$), normalized mutual-information score $0.876$ ($[0.766, 0.986]$) and Folkwes-Mallow score $0.901$ ($[0.791, 0.994]$). As expected, the similarity for seven clusters between the individual trails is smaller compared to the three cluster case which is also reflected in the noticeable larger confidence intervals (depicted with error bars) in Fig.~\ref{fig:big_7_mc} than in Fig.~\ref{fig:big_3_mc}.
 
 Since for the three cluster case, all trials are very similar to each other, in the following, we focus on the results of the first trial. We represent each cluster by one-dimensional projections to each input parameter, which results in one distribution per cluster and input parameter. The different distributions are shown in Fig.~\ref{fig:distribution}. For the case of three clusters, we can identify Cluster 1 as typical slow solar wind, Cluster 2 as typical coronal hole wind and Cluster 3 as plasma from compression regions. The compression region cluster is identified by a high proton density and a high collisional age and contains the smallest number of data points. As expected, the corona hole wind cluster exhibits high proton speed, low proton density and high proton temperature. The slow solar wind cluster contains the highest number of data points and is characterized by its high (but lower than in the compression region cluster) proton density, low proton temperature and a higher $\nO$ compared to the corona hole wind cluster.