One of the easiest ways to store data is to use text files (or comma separated files .csv
).
Nearly every tool can handle such data.
To read such data into R we use the function read.table
.
read.table
has a lot of arguments to control it's functionality:
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
The most useful ones are:
file
: character, the path to the file (either absolute or relative to the working directory [#! need link here
])header
: logical (eitherTRUE
orFALSE
). Does the file include column names in the first row?sep
: columns can be separated in text files by different characters (e.g.,
or;
or|
)dec
: the decimal separator. Default is.
, but in some countries (like Germany) this can also be,
stringsAsFactors
: Should be strings interpreted as factors? It tend to usestringsAsFactors = FALSE
(default isTRUE
) and convert only if needed.
So a typical command to load data would be:
setwd( "/home/edisz/Documents/Uni/Projects/qetxrbook")
df <- read.table('data/iris1.csv', sep = ';', header = TRUE, stringsAsFactors = FALSE)
After reading the data,
always check if all is correct.
You can use the str()
function for this:
str(df)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 NA 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
Look at:
- correct number of rows (=obs.)
- correct number of columns (=variables)
- correct type of columns. Sometimes variable that should be numeric are recorded as factors. This is an indication that there is at least some non-numeric data in this column - check.
- correct column names
Writing text files is done via write.table
. It has a similar set of attributes as above.
However, to others should be mentioned:
append
: IfTRUE
the data is appended to the file. Otherwise, the existing file is overwritten.row.names
: Is always userow.names=FALSE
, because row-names should not contain information and the columns are better aligned.
So to write the above data.frame to a file use:
write.table(df, "/home/edisz/mytable.csv", sep = ';', row.names = FALSE)
Eventually you will have to work with Microsoft Excel files. Excel file are not a good format to store data and have been hard to read into R.
Luckily, there is now the readxl
package available on CRAN to make this easier.
R can also read a lot of other formats (SPSS, GIS-data, ...), however you'll certainly need additional packages. Just search for it on the internet.
R can connect to all major databases like MySQL, PostgreSQL or SQLite.
The usage is similar for all:
- Connect to the database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = DBname, user = DBuser, host = DBhost, port = DBport)
where DBname
, DBuser
, DBhost
and DBport
are characters specifying the details of the connection.
- Query the data you need:
df <- dbGetQuery(con, 'SELECT * FROM public.df')
- Close the connection when finished:
dbDisconnect(con)
dbUnloadDriver(drv)
There are 3 additional data files in the data-directory (iris2.csv
, iris3.csv
and iris4.csv
). Each of this files has different 'problems'.
Read each into R using read.table()
and specify appropriate arguments.
It may be helpful to take a look at the files using a text-editor.
The data should have the same structure as above.
(Hint: read the help for read.table carefully).