How can I do basic data manipulation in Galaxy? Which tools are available to convert, reformat, filter, sort etc my text-based data?
In this tutorial, we will use as our dataset a table with results from the Olympics, from the games in Athens in 1896 until Tokyo in 2020.
The objective is to familiarize you with a large number of the most important data manipulation tools in Galaxy.
The dataset contains 234,522 rows and 17 columns. Each row corresponds to an individual athlete competing in an individual Olympic event. The columns are:
olympics.tsv
https://zenodo.org/record/6803028/files/olympics.tsv
olympics
Question
Line/Word/Character count
Tool)olympics.csv
.We will sort the file in chronological order based on the year of the Olympic games (column ??)
Sort - data in ascending or descending order
Tool
Now, we want to sort twice, first by year, an then within each year, we sort again alphabetically by name.
We will filter the file to show only winter Olympics
olympics.tsv
file. Which column contains this information?Filter data on any column using simple expressions
View the filtered dataset
Rename both
Exercises
Solutions
c17=='Gold'
)c17=='Gold' or c17=='Silver' or c17=='Bronze'
, or c17!='NA'
)c17=='Gold' and c12==2018
(note: do not use quotes around 2018, as it is a numerical value))c17!='NA' and c7>=170 and c7<=180
)c17=='Gold' and (c7<160 or c7>190)
(note: parentheses are important here))Let’s start by simply counting how many different Olympic Games we have in our dataset, and how many times it appears (so how many participations there were each year)
Count occurrences of each record
with the following parameters:
Question
Let’s try counting the number of men and women in each olympic games.
Count occurrences of each record
with the following parameters:
Question
We will now determine how many different sport there were in each of the different Olympics
Datamash
with the following parameters:
olympic.tsv
Question
Exercises
Solutions
1 - Count
Tool with the following parameters:
This gives an output like:
17286 United States
11700 France
10230 Great Britain
8898 Italy
7988 Canada
2 - Datamash
Tool with the following parameters:
This gives an output like:
1896 Summer Olympics 15
1900 Summer Olympics 29
1904 Summer Olympics 13
...
2016 Summer Olympics 280
2018 Winter Olympics 108
2020 Summer Olympics 250
Often we may want to group rows based on a value in a column, and perform some operation on the resulting rows. For example we would like to group the olympics data by one value (e.g. year, country, sport), and determine some value for each group (e.g. number of medals won, average age of athletes, etc).
We can use the Datamash
tool for this purpose.
Datamash
with the following parameters:
olympic.tsv
Question
The question we would like to answer here, is what is the average height for men and women per sport?
Datamash
with the following parameters:
olympic.tsv
Exercises
Solutions
Datamash
Tool with the following parameters:
This will give an output like below, scroll down to find the gold medalists, then badminton, then F
GroupBy(medal) GroupBy(sport) GroupBy(sex) min(height)
Bronze Alpine Skiing F 156
Bronze Alpine Skiing M 167
Bronze Archery F 155
Bronze Archery M 166
Bronze Art Competitions F -inf
Bronze Art Competitions M 172
Bronze Artistic Gymnastics F 136
...
Datamash
Tool with the following parameters:
This will give an output like below:
GroupBy(year) GroupBy(team) mean(height) pstdev(height)
1896 Australia nan nan
1896 Austria nan nan
1896 Belgium nan nan
1896 Bulgaria nan nan
1896 Denmark nan nan
1896 France 167.62962962963 4.6836844324555
...
Sometimes we want to use the data in our column to compute a new value, and add that to the table. As an example, let’s calculate the age of each athlete at the time of participation, and add this as a new column to our dataset.
Compute on rows
tool with the following parameters
olympic.tsv
Question
Exercises
Solutions
Compute on rows
Tool with the following parameters:
int(c8)/(int(c7)*int(c7))*10000
Our file uses a mix of Athina and Athens to indicate the Capital City of Greece in the city column. Let’s standardize this by replacing occurrences of Athina with Athens.
Column Regex Find and Replace
tool with the following parameters:
We can remove columns from a table using either Remove columns by heading
if your table has a header line,
or Cut columns from a table
if it does not (in this case we just indicate columns by their number).
These tools can also be used to change the order of columns in your file.
There is also the tool Advanced Cut columns from a table
that offers a few more options.
Suppose we want to simplify our file a bit. All we want is file with 4 columns: athlete name, sport, olympic games, and medals.
Remove columns by heading
tool with the following parameters:
Let’s say we would like to create a list of all unique athletes (id and name).
Cut columns from a table
tool using the following parameters:
Unique - occurrences of each record
tool with the following parameters:
Our dataset was created in 2021, but since then we’ve had another Olympic event, the 2022 Winter Olympics in Beijing. If we have the same data for this latest Olympics, we could simply add the rows from the 2022 games to our current file with data, in order to create a single file with all data from 1896 to 2022.
olympics.tsv
https://zenodo.org/record/6803028/files/olympics_2022.tsv
Remove beginning of a file
with the following parameters:
Concatenate datasets tail-to-head
with the following parameters: