This is a summary, written by members of the CITF Secretariat, of:

Rhodes JS, Aumon A, Morin S, Girard M, Larochelle C, Lahav B, Brunet-Ratnasingham E,  Pagliuzza A, Machitto L, Zhang W, Cutler A, Grand’Maison F, Zhou A, Finzi A, Chomont N, Kaufmann DE, Zandee S, Prat A, Wolf G, Moon KR. Gaining Biological Insights through Supervised Data Visualization. medRxiv. January 21, 2024; doi: https://doi.org/10.1101/2023.11.22.568384.

The results and/or conclusions contained in the research do not necessarily reflect the views of all CITF members.

A CITF-funded study, published in preprint and not yet peer-reviewed, introduced a data visualization method called RF-PHATE, which was able to generate low-dimensional visualizations highlighting relevant data relationships while disregarding extraneous factors. The researchers demonstrated their algorithm’s abilities through case studies of diesel exhaust-exposed lung cells, multiple sclerosis, and COVID-19. This study was led by Dr. Jake S. Rhodes (Brigham Young University) and Dr. Adrien Aumon (Quebec AI Institute and Université de Montréal) in collaboration with Dr. Daniel Kaufmann (Centre hospitalier de l’Université de Montréal).

RF-PHATE works by training a random forest, a type of machine-learning tool that operates by constructing a multitude of decision trees to learn the relationships between features of the data to predict the data labels. This information is then extracted to create visualizations which reflect the relationship between points (when focusing on specific features) while ignoring irrelevant features.

Key findings:

Using RF-PHATE, the researchers demonstrated the algorithm’s advantages in three case studies:

  • Aptitude with longitudinal data sets to identify subgroups of patients with multiple sclerosis (MS) while preserving the original structure;
  • Ability to create meaningful visualizations with inherently noisy data by identifying the impact of antioxidants on lung cells in the context of spectral data;
  • Capacity to enrich interpretability of data in a hierarchical manner as demonstrated by its ability to align known antibody profiles with clinically-established patient outcome labels in COVID-19 cases.

Key benefits of RF-PHASE:

  • Combines the predictive power of random forests with strong visualization methods;
  • Is scalable to small and large datasets and works well for both contiguous and categorical labels;
  • Is robust to noise due in part to the ability of random forests to determine feature importance;
  • This is a supervised method. It can incorporate auxiliary information, such as expert-derived metadata or annotations, to provide valuable insights about relationships in the dataset (versus unsupervised methods, which preserve the dominant structure of the data and may include irrelevant features).