Exploring datasets with statistics and scatterplots

Often, one or more data pre-processing step(s) may be required to proceed with the analysis. And, after the analysis, maybe you would like to visualize the result or datasets


The Iris flower data set, also known as Fisher’s or Anderson’s Iris data set, is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (Fisher 1936). Each row of the table represents an iris flower sample, describing its species and the dimensions in centimeters of its botanical parts, the sepals and petals.


Quick Start

  1. Create a new History
  2. Rename your history to be meaningful and easy to find.
  3. Upload the Iris dataset iris.csv
     https://zenodo.org/record/1319069/files/iris.csv 
    
  4. View the dataset
  5. Rename the dataset to iris
  6. Check the datatype and change the datatype if it is different than csv.
  7. Add an #iris tag to the dataset

Go Up


Pre-Processing

The tools we will use require tab-separated input data and assume there is no header line. Since our data is comma-separated and has a header line, we will have to perform the following pre-processing steps to prepare it for the actual analysis:

  • Format conversion
  • Header removal

Format conversion

  1. Convert the CSV file (comma-separated values) to tabular format (tsv; tab-separated values)
    • In the central panel, click on the Convert tab on the top \
    • In the upper part Convert select tabular (using csv-to-tabular)
    • Click the Create dataset button to start the conversion.
  2. Rename the resulting dataset to iris tabular
  3. View the generate file

Header removal

  1. Use the tools search box at the top of the tool panel to find Remove beginning
  2. Remove beginning with the following parameters:
    • Remove first: 1 (to remove the first line only)
    • from: select the iris tabular file from your history
    • Execute
  3. Rename the resulting dataset to iris clean
  4. View the generate file

Question

  • How many samples (lines) does our dataset contain?

Go Up


Data Analysis: What does the dataset contain?


How many different species are in the dataset?

In order to answer this question, we will have to look at column 5 of our file, and count how many different values (species) appear there. There are several ways we could do this in Galaxy.

Extract species

One approach might be to first extract this column from the file, and then count how many unique lines the file contains.

  1. Cut columns from a table with the following parameters:
    • Cut columns: c5
    • Delimited by: tab
    • From: select the iris clean file from your history
  2. Rename the resulting dataset to iris species column
  3. View the generate file
  4. Unique occurrences of each record with the following parameters:
    • File to scan for unique values: select the iris species coulumn file from your history
  5. Rename the resulting dataset to iris species
  6. View the generate file

Question

  1. How many different species are in the dataset?
  2. What are the different Iris species?

Grouping dataset

  1. Try answering this question (how many Iris species are in the file?) again, using a different approach:
  2. Did you get the same answer as before?
  3. Rename the resulting dataset to iris species group

Go Up


How many samples by species are in the dataset?

Now that we know that there are 3 different species in our dataset, our next objective is determining how many samples of each species we have. To answer this, we need to look at column 5 again, but instead of just determining how many unique values there are, we need to count how many times each of them occurs.

  1. Re-run the Group tool with the following parameters:
    • Select data: iris clean
    • Group by column: Column: 5
    • Insert operation:
      • Type: Count
      • On Column: 1
  2. Rename the resulting dataset to iris samples per species group
  3. View the generate file

Question

  • How many samples per species are in the dataset?

Analysis: How to differentiate the different Iris species?

We know that we have 3 species of iris flowers, with 50 samples for each:

  • setosa
  • versicolor
  • virginica

Our aim is to find out whether the features we have been given for each species can help us to highlight the differences between the 3 species.

In our dataset, we have the following features measured for each sample:

  • Petal length
  • Petal width
  • Sepal length
  • Sepal width

Go Up


Generate summary and descriptive statistics

Get the mean and sample standard deviation of Iris flower features

  1. Datamash tool with the following parameters:
    • Input tabular dataset: iris tabular
    • Group by fields: 5
    • Input file has a header line: Yes
    • Print header line: Yes
    • Sort input: Yes
    • Print all fields from input file: No
    • Ignore case when grouping: Yes
    • In Operation to perform on each group:
      • Insert Operation to perform on each group
        • Type: Mean
        • On column: c1
      • Insert Operation to perform on each group
        • Type: Sample Standard deviation
        • On column: c1
      • Insert Operation to perform on each group
      • Type: Mean
      • On column: c2
      • Insert Operation to perform on each group
        • Type: Sample Standard deviation
        • On column: c2
      • Insert Operation to perform on each group
        • Type: Mean
        • On column: c3
      • Insert Operation to perform on each group
        • Type: Sample Standard deviation
        • On column: c3
      • Insert Operation to perform on each group
        • Type: Mean
        • On column: c4
      • Insert Operation to perform on each group
        • Type: Sample Standard deviation
        • On column: c4
  2. Rename the resulting dataset to iris summary and statistics
  3. View the generate file

Question

  • Can we differentiate the different Iris flower species?

Go Up


Visualize Iris dataset features with two-dimensional scatterplots

  1. Scatterplot w ggplot2 with the following parameters:
    • Input tabular dataset: iris clean
    • Column to plot on x-axis: 1
    • Column to plot on y-axis: 2
    • Plot title: Sepal length as a function of sepal width
    • Label for x axis: Sepal length
    • Label for y axis: Sepal width
    • In Advanced Options:
      • Data point options: User defined point options
        • relative size of points: 2.0
      • *Plotting multiple groups: Plot multiple groups of data on one plot
        • column differentiating the different groups: 5
        • Color schemes to differentiate your groups: Set 2 - predefined color pallete
    • In Output Options:
      • Additional output format: PDF
  2. View the generate plot

  3. Rename the resulting dataset to iris sepal scatterplot

Question

  1. What does this scatter plot tell us about Iris species?
  2. Make a new scatter plot, this time with Petal length versus Petal width.
  3. Can we differentiate between the three Iris species?

Go Up


Galaxy Management


Convert your analysis history into a workflow

  1. Follow the instructions from the previous lecture
  2. Replace the Workflow name to something more descriptive, for example: Exploring Iris dataset with statistics and scatterplots.
  3. Click on Workflow in the top menu of Galaxy.

The workflow editor

  1. Open the workflow editor
    • Click on the dropdown menu to the right of your workflow name.
  2. Select Edit to launch the workflow editor.
    • When you click on a workflow step, you will get a view of all the parameter settings for that tool on the right-hand side of your screen (the Details section)
    • You can also change the parameter settings of your workflow here, and also do more advanced configuration.
  3. Hiding intermediate outputs
    • By default, all outputs will be shown
    • Click the checkbox next to the outputs to mark them as important:
      • outfile in Unique step
      • out_file1 in Group step
      • png in both Scatterplot w ggplot2 steps
      • Now, when we run the workflow, we will only see these final outputs
  4. Renaming output datasets
    • Let’s rename the outputs we marked as important with the checkbox (and thus do not hide) to more meaningful names:
      • Unique tool, output outfile: rename to categories
      • Group tool, output out_file1: rename to samples per category
      • Rename the scatterplot outputs as well, remember to choose a generic name, since we can now also run this on data other than iris plants.
  5. Save your workflow (important!) by clicking on the icon at the top right of the screen.
  6. Return to the analysis view by clicking on the Home icon (or Analyze Data on older Galaxy versions) at the top menu bar.

Go Up


Run workflow on different data

Now that we have built our workflow, let’s use it on some different data. For example, let us explore the diamonds dataset The original dataset consists of 53940 specimen of diamonds, for which it lists the prices and various properties. For this training, we have created a simpler dataset from the original, in which only the five columns relating to the price and the so-called 4 Cs (carat, cut, color and clarity) of diamond characteristics have been retained.

  • Carat refers to the weight of the diamond when measured on a scale
  • Cut refers to the quality of the cut and can take the grades Fair, Good, Very Good, Premium and Ideal
  • Color describes the overall tint, or lack thereof, of the diamond from colorless/white to yellow and is given on a letter scale ranging from D to Z (D being the best, known as colorless).
  • Clarity describes the amount and location of naturally occuring “inclusions” found in nearly all diamonds on a scale of eleven grades ranging from Flawless (the ideal situation) to I3 (Included level 3, the worst quality).

Starter Pack

  1. Create a new History
  2. Rename your history to be meaningful and easy to find.
  3. Upload the Iris dataset diamons.csv
     https://zenodo.org/record/3540705/files/diamonds.csv
    
  4. View the dataset
  5. Rename the dataset to diamond
  6. Add an #diamond tag to the dataset

Run workflow

  1. Open the workflow menu (top menu bar), Find the workflow you made in the previous section and Select the option Run.
  2. Select the diamonds dataset as the input dataset.
  3. Customize the first scatter plot:
  4. Customize the second scatter plot.
  5. Click Run workflow.
  6. Once the workflow has started, you will initially be able to see all its steps, but the unimportant intermediates will disappear after they complete successfully

Question

  1. How many cut category are there in the Diamond dataset ?
  2. How many samples are there in each cut category ?
  3. What do you notice about the relationship between price and carat ?
  4. Based on the plot showing Price vs. Carat with Clarity as a factor, do you think clarity accounts for some of the variance in price? Why ?

Go Up