Data Manipulation

How can I do basic data manipulation in Galaxy? Which tools are available to convert, reformat, filter, sort etc my text-based data?

In this tutorial, we will use as our dataset a table with results from the Olympics, from the games in Athens in 1896 until Tokyo in 2020.

The objective is to familiarize you with a large number of the most important data manipulation tools in Galaxy.

The dataset contains 234,522 rows and 17 columns. Each row corresponds to an individual athlete competing in an individual Olympic event. The columns are:

athlete_id - Unique number for each athlete
name - Athlete’s name
sex - M or F
birth_year - 4-digit number
birth_day - e.g. 24 July
birth_place - town and/or country
height - In centimeters (or NA if data not known)
weight - In kilograms (or NA if data not known)
team - Team name
noc - National Olympic Committee 3-letter code
games - Year and season
year - Integer
season - Summer or Winter
city - Host city
sport - Sport
event - Event
medal - Gold, Silver, Bronze (or NA if no medal was won)

Quick Start

Create a new History
Rename your history to be meaningful and easy to find.

Upload the dataset olympics.tsv

 https://zenodo.org/record/6803028/files/olympics.tsv

View the dataset
Rename the dataset to olympics

Question

What is the format of the file?
What does each row represent?
How many lines are in the file? (Line/Word/Character count Tool)
How many columns?

Go Up

File Format Conversion

Convert Tabular to CSV
View the dataset
Rename the dataset to olympics.csv.

Go Up

Sorting

We will sort the file in chronological order based on the year of the Olympic games (column ??)

Open the Sort - data in ascending or descending order Tool
- *Sort Dataset”: olympics.tsv
- on column: Column 12
- in: Ascending order
- *Flavor”: Numeric sort
Execute
View the sorted dataset

Go Up

Sort on multiple columns at once

Now, we want to sort twice, first by year, an then within each year, we sort again alphabetically by name.

Re-run the
All parameter from the first step should already be set for you, and should remain the same
Insert Column selections:
- on column: Column 2
- in: Ascending order
- Flavor: Alphabetical sort
Execute
View the sorted dataset

Go Up

Filtering

We will filter the file to show only winter Olympics

Look at the olympics.tsv file. Which column contains this information?
Open the tool Filter data on any column using simple expressions
- Filter: olympics.tsv
- With the following condition: c13==’Winter’
- Number of header lines to skip: 1
Execute
View the filtered dataset
Question
How many lines are in this file?

Re-run the step for the Summer Olympics

Rename both

Exercises

How many gold medals were handed out?
How many total medals?
How many medals were handed out during the 2018 Olympics?
How many medals were won by individuals with a height between 170 and 180 cm?
How many gold medals were won by individuals shorter than 160cm or taller than 190?

Solutions

8,110 (Expression: c17=='Gold')
24,633 (Expression: c17=='Gold' or c17=='Silver' or c17=='Bronze', or c17!='NA')
131 (Expression: c17=='Gold' and c12==2018 (note: do not use quotes around 2018, as it is a numerical value))
8,086 (Expression: c17!='NA' and c7>=170 and c7<=180)
812 (Expression: c17=='Gold' and (c7<160 or c7>190) (note: parentheses are important here))

Go Up

Counting

Count occurrences of values in columns

Let’s start by simply counting how many different Olympic Games we have in our dataset, and how many times it appears (so how many participations there were each year)

Count occurrences of each record with the following parameters:
- from dataset: olympics.tsv
- Count occurrences of values in column(s): Column 11
- How should the results be sorted?: most common values first
View the result

Question

How many different Olympic games are in our file?
Which Olympic games had the most participations?

Let’s try counting the number of men and women in each olympic games.

Count occurrences of each record with the following parameters:
- from dataset: olympics.tsv
- Count occurrences of values in column(s): Column 11, Column 3
- How should the results be sorted?: most common values first
View the result

Question

How many women were in the first Olympic games?
Which Olympic games had the most women participants?

Go Up

Counting number of different sports per Olympic Games

We will now determine how many different sport there were in each of the different Olympics

Datamash with the following parameters:
- Input tabular dataset: olympic.tsv
- Group by fields: 11 (game column)
- Input file has a header line: Yes
- Print header line: Yes
- Sort input: Yes
- Ignore case when grouping: Yes
- In Operation to perform on each group:
  - Type: Count Unique values
  - On column: c15 (sport column)
View the result
Rename the output to something descriptive

Question

How many sport were in the first Olympics? How many in the latest?
Which Olympics had the most different sports?

Exercises

Which country has had the most participations in the Olympics?
How many countries participated in the first Olympics? How many in the last?

Solutions

1 - Count Tool with the following parameters:

from dataset: olympics.tsv
Count occurrences of values in column(s): Column 9 (the team column)
How should the results be sorted?: With the most common values first

This gives an output like:

United States
France
Great Britain
Italy
Canada

2 - Datamash Tool with the following parameters:

Input tabular dataset: olympics.tsv
Group by fields: 11 (the games column)
Input file has a header line: yes
Sort Input: yes

Operation to perform on each group:

Type: Count Unique values
On Column: Column: 9 (the team column)

This gives an output like:

Summer Olympics	15
Summer Olympics	29
Summer Olympics	13
  ...
Summer Olympics	280
Winter Olympics	108
Summer Olympics	250

Go Up

Grouping

Often we may want to group rows based on a value in a column, and perform some operation on the resulting rows. For example we would like to group the olympics data by one value (e.g. year, country, sport), and determine some value for each group (e.g. number of medals won, average age of athletes, etc).

We can use the Datamash tool for this purpose.

Tallest athlete per sport

Datamash with the following parameters:
- Input tabular dataset: olympic.tsv
- Group by fields: 15
- Input file has a header line: Yes
- Print header line: Yes
- Sort input: Yes
- Ignore case when grouping: Yes
- Skip NA or NaN values”: Yes
- In Operation to perform on each group:
  - Type: Maximun
  - On column: c7
View the result

Question

How tall was the tallest athlete in basketball?
And what about karate?

Go Up

Grouping on multiple columns

The question we would like to answer here, is what is the average height for men and women per sport?

Datamash with the following parameters:
- Input tabular dataset: olympic.tsv
- Group by fields: 15,3 (sport and sex column)
- Input file has a header line: Yes
- Print header line: Yes
- Sort input: Yes
- Ignore case when grouping: Yes
- Skip NA or NaN values”: Yes
- In Operation to perform on each group:
  - Type: Mean
  - On column: c7
View the result

Exercises

How tall is the shortest woman Badminton player to win a gold medal?
What is the average height and standard deviation of athletes from Denmark (DEN) in the 1964 Olympics?

Solutions

Datamash Tool with the following parameters:
- Input tabular dataset: olympics.tsv
- Group by fields: 17,15,3 (The medal, sports, and the sex columns)
- Sort input: yes
- Input file has a header line: yes
- Print header line yes
- Skip NA or NaN values: yes
- Operation to perform on each group:
  - Type: Minimum
  - On Column: Column: 7
  This will give an output like below, scroll down to find the gold medalists, then badminton, then F
```
GroupBy(medal)	GroupBy(sport)	GroupBy(sex)	min(height)
Bronze	Alpine Skiing	F	156
Bronze	Alpine Skiing	M	167
Bronze	Archery	F	155
Bronze	Archery	M	166
Bronze	Art Competitions	F	-inf
Bronze	Art Competitions	M	172
Bronze	Artistic Gymnastics	F	136
...
```
Datamash Tool with the following parameters:
- Input tabular dataset: olympics.tsv
- Group by fields: 12,9 (year and team columns)
- Sort input: yes
- Input file has a header line: yes
- Print header line: yes
- Skip NA or NaN values: yes
- Operation to perform on each group:
  - Type: Mean
  - On Column: Column: 7
- Operation to perform on each group:
  - Type: Population Standard deviation
  - On Column: Column: 7
This will give an output like below:

GroupBy(year)	GroupBy(team)	mean(height)	pstdev(height)
Australia	nan	nan
Austria	nan	nan
Belgium	nan	nan
Bulgaria	nan	nan
Denmark	nan	nan
France	167.62962962963	4.6836844324555
...

Go Up

Computing

Sometimes we want to use the data in our column to compute a new value, and add that to the table. As an example, let’s calculate the age of each athlete at the time of participation, and add this as a new column to our dataset.

Compute on rows tool with the following parameters
- Input file: olympic.tsv
- Input has a header line with column names?: Yes
- In Expressions:
  - Add expression: int(c12)-int(c4) (year column minus the year_of_birth column)
  - Mode of the operation?: Append
  - The new column name: age
- Under Error handling:
  - Autodetect column types: No
  - If an expression cannot be computed for a row: Fill in a replacement value
  - Replacement value: NA (not available)
View the result
Rename the output to something descriptive

Question

How old was Arnaud Boetsch during his Olympic tennis participation?

Exercises

Calculate BMI (weight/height²) for all athletes and add it as a new column directly after the existing height and weight columns.(Remember that our height is in cm, and the formula expects height in meters)

Solutions

Compute on rows Tool with the following parameters:
- Input file: olympics.tsv
- Input has a header line with column names?: Yes
- In Expressions:
  - *Add expression: int(c8)/(int(c7)*int(c7))*10000
  - Mode of the operation?: Insert
  - Insert new column before existing column number: 9
  - The new column name: BMI
- Under Error handling:
- Autodetect column types: No
  - If an expression cannot be computed for a row: Fill in a replacement value
  - Replacement value: NA (not available)

Go Up

Find and Replace

Our file uses a mix of Athina and Athens to indicate the Capital City of Greece in the city column. Let’s standardize this by replacing occurrences of Athina with Athens.

Column Regex Find and Replace tool with the following parameters:
- Select cells from: olympics.tsv
- using column: Column 14
- Check:
  - Find Regex: Athina
  - Replacement: Athens
View the result

Go Up

Removing Columns

We can remove columns from a table using either Remove columns by heading if your table has a header line, or Cut columns from a table if it does not (in this case we just indicate columns by their number). These tools can also be used to change the order of columns in your file. There is also the tool Advanced Cut columns from a table that offers a few more options.

Suppose we want to simplify our file a bit. All we want is file with 4 columns: athlete name, sport, olympic games, and medals.

Remove columns by heading tool with the following parameters:
- Tabular File: olympics.tsv
- Header name: name
- Header name: sport
- Header name: games
- Header name: medals
- Keep named columns: Yes
View the result

Go Up

Unique

Let’s say we would like to create a list of all unique athletes (id and name).

Cut columns from a table tool using the following parameters:
- Cut Columns: c1,c2
- Delimited By: TAB
- From: olympics.tsv
View the result
Unique - occurrences of each record tool with the following parameters:
- File to scan for unique values*: output from the first step
View the result

Go Up

Concatenating

Our dataset was created in 2021, but since then we’ve had another Olympic event, the 2022 Winter Olympics in Beijing. If we have the same data for this latest Olympics, we could simply add the rows from the 2022 games to our current file with data, in order to create a single file with all data from 1896 to 2022.

Upload the dataset olympics.tsv

 https://zenodo.org/record/6803028/files/olympics_2022.tsv

Remove beginning of a file with the following parameters:
- Remove first: 1
- from: olympics_2022.tsv
Concatenate datasets tail-to-head with the following parameters:
- Concatenate Datasets: olympics.tsv
- Insert Dataset: output from previous step 1 (2022 data without the header)

Go Up

Training Course in Galaxy for Metagenomic Analysis on Targeted and Whole Genome Sequencing Data

Data Manipulation

Quick Start

File Format Conversion

Sorting

Sort on multiple columns at once

Filtering

Counting

Count occurrences of values in columns

Counting number of different sports per Olympic Games

Grouping

Tallest athlete per sport

Grouping on multiple columns

Computing

Find and Replace

Removing Columns

Unique

Concatenating