1 - Data Preparation
Marina Papadopoulou
Source:vignettes/step1_data_preparation.Rmd
step1_data_preparation.Rmd
1.1 Input data - trackdf
The swaRmverse
package uses the trackdf
package to standardize the input dataset. Data are expected to be
trajectories (id, x, y, t) generated by GPS or video tracking. First,
lets load some data from trackdf
:
library(swaRmverse)
raw <- read.csv(system.file("extdata/video/01.csv", package = "trackdf"))
raw <- raw[!raw$ignore, ]
head(raw)
## id x y size frame track ignore track_fixed
## 1 1 629.3839 882.4783 1154 1 1 FALSE 1
## 2 2 1056.1692 656.5207 1064 1 2 FALSE 2
## 3 3 508.0092 375.2451 1624 1 3 FALSE 3
## 4 4 1277.6466 373.7491 1443 1 4 FALSE 4
## 5 5 1379.2844 343.0853 1431 1 5 FALSE 5
## 6 6 1137.1378 174.5110 1321 1 6 FALSE 6
1.2 Transform data
trackdf
takes as input a vector for each positional time
series (x,y) along with an vector of ids and time. Time will be
transformed to date-time POSIXct format. Without additional information,
the package uses UTC as timezone, current time as the origin of the
experiment, and 1 second as the sampling step (time between
observations). If your t column corresponds to real time (and
not frames or sampling steps, e.g., c(1, 2, 3, 4)), then the
period doesn’t have to be specified. For more details, see https://swarm-lab.github.io/trackdf/index.html. For now,
let’s specify these attributes and create our main dataset (as a
dataframe):
data_df <- set_data_format(raw_x = raw$x,
raw_y = raw$y,
raw_t = raw$frame,
raw_id = raw$track_fixed,
origin = "2020-02-1 12:00:21",
period = "0.04S",
tz = "America/New_York"
)
head(data_df)
## Track table [6 observations]
## Number of tracks: 6
## Dimensions: 2D
## Geographic: FALSE
## Table class: data frame ('data.frame')
## id t x y set
## 1 1 2020-02-01 12:00:21 629.3839 882.4783 2020-02-01
## 2 2 2020-02-01 12:00:21 1056.1692 656.5207 2020-02-01
## 3 3 2020-02-01 12:00:21 508.0092 375.2451 2020-02-01
## 4 4 2020-02-01 12:00:21 1277.6466 373.7491 2020-02-01
## 5 5 2020-02-01 12:00:21 1379.2844 343.0853 2020-02-01
## 6 6 2020-02-01 12:00:21 1137.1378 174.5110 2020-02-01
You can now notice that a ‘set’ column is added to the dataset.
swaRmverse
is using this column as the main unit for
grouping the tracks into separate events. By default, the day of data
collection is used.
1.3 Multi-species or multi-context data
As mentioned above, swaRmverse
uses the date as a
default data organization unit. However, if several separate
observations are conducted in the same day, or an additional label on
the data is needed, such as context or species, additional information
can be given to the function. For instance, let’s assume that data from
2 different contexts exist in the data set:
We can give any additional vector to the function and it will be combined with the date column as a set:
data_df <- set_data_format(raw_x = raw$x,
raw_y = raw$y,
raw_t = raw$frame,
raw_id = raw$track_fixed,
origin = "2020-02-1 12:00:21",
period = "0.04 seconds",
tz = "America/New_York",
raw_context = raw$context
)
head(data_df)
## Track table [6 observations]
## Number of tracks: 6
## Dimensions: 2D
## Geographic: FALSE
## Table class: data frame ('data.frame')
## id t x y set
## 1 1 2020-02-01 12:00:21 629.3839 882.4783 2020-02-01_ctx1
## 2 2 2020-02-01 12:00:21 1056.1692 656.5207 2020-02-01_ctx1
## 3 3 2020-02-01 12:00:21 508.0092 375.2451 2020-02-01_ctx1
## 4 4 2020-02-01 12:00:21 1277.6466 373.7491 2020-02-01_ctx1
## 5 5 2020-02-01 12:00:21 1379.2844 343.0853 2020-02-01_ctx1
## 6 6 2020-02-01 12:00:21 1137.1378 174.5110 2020-02-01_ctx1
With this dataset, we can move on into analyzing the collective motion in the data.