forked from gjkerns/IPSUR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIPSUR.org
20340 lines (17177 loc) · 804 KB
/
IPSUR.org
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# IPSUR: Introduction to Probability and Statistics\\ Using R
# Copyright (C) 2014 G. Jay Kerns
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#+TITLE: Introduction to Probability and Statistics Using R
#+AUTHOR: G. Jay Kerns
#+EMAIL: [email protected]
#+LANGUAGE: en
#+OPTIONS: ':nil *:t -:t ::t <:t H:5 \n:nil ^:{} arch:headline
#+OPTIONS: author:t c:nil creator:comment d:nil date:t e:t email:nil
#+OPTIONS: f:nil inline:t num:t p:nil pri:nil stat:t tags:nil
#+OPTIONS: tasks:t tex:t timestamp:t toc:nil todo:t |:t
#+SELECT_TAGS:
#+PROPERTY: session *R*
#+PROPERTY: exports results
#+PROPERTY: results value raw
#+PROPERTY: cache no
#+LaTeX_CLASS: scrbook
#+LaTeX_CLASS_OPTIONS: [captions=tableheading]
#+LaTeX_CLASS_OPTIONS: [10pt,english,twoside]
#+LaTeX_HEADER: \input{include/preamble}
#+LATEX: \input{include/frontmatter}
#+LATEX: \input{include/preface-second}
#+CREATOR: Emacs 24.3.1 (Org mode 8.0.7)
* An Introduction to Probability and Statistics :introps:
:PROPERTIES:
:tangle: R/01-introps.R
:END:
#+LaTeX: \pagenumbering{arabic}
#+LaTeX: \noindent
This chapter has proved to be the hardest to write, by far. The
trouble is that there is so much to say -- and so many people have
already said it so much better than I could. When I get something I
like I will release it here.
In the meantime, there is a lot of information already available to a
person with an Internet connection. I recommend to start at Wikipedia,
which is not a flawless resource but it has the main ideas with links
to reputable sources.
In my lectures I usually tell stories about Fisher, Galton, Gauss,
Laplace, Quetelet, and the Chevalier de Mere.
** Probability
The common folklore is that probability has been around for millennia
but did not gain the attention of mathematicians until approximately
1654 when the Chevalier de Mere had a question regarding the fair
division of a game's payoff to the two players, supposing the game had
to end prematurely.
** Statistics
Statistics concerns data; their collection, analysis, and
interpretation. In this book we distinguish between two types of
statistics: descriptive and inferential.
Descriptive statistics concerns the summarization of data. We have a
data set and we would like to describe the data set in multiple
ways. Usually this entails calculating numbers from the data, called
descriptive measures, such as percentages, sums, averages, and so
forth.
Inferential statistics does more. There is an inference associated
with the data set, a conclusion drawn about the population from which
the data originated.
I would like to mention that there are two schools of thought of
statistics: frequentist and bayesian. The difference between the
schools is related to how the two groups interpret the underlying
probability (see Section [[#sec-Interpreting-Probabilities]]). The frequentist
school gained a lot of ground among statisticians due in large part to
the work of Fisher, Neyman, and Pearson in the early twentieth
century. That dominance lasted until inexpensive computing power
became widely available; nowadays the bayesian school is garnering
more attention and at an increasing rate.
This book is devoted mostly to the frequentist viewpoint because that
is how I was trained, with the conspicuous exception of Sections
[[#sec-Bayes-Rule]] and [[#sec-Conditional-Distributions]]. I plan to add more
bayesian material in later editions of this book.
#+LaTeX: \newpage{}
** Exercises
#+LaTeX: \setcounter{thm}{0}
#+INCLUDE: "~/git/IPSUR/include/prelim.R" src R
* An Introduction to R :introR:
:PROPERTIES:
:tangle: R/02-introR.R
:CUSTOM_ID: cha-introduction-to-R
:END:
#+BEGIN_SRC R :exports none :eval never
# IPSUR: Introduction to Probability and Statistics Using R
# Copyright (C) 2014 G. Jay Kerns
#
# Chapter: An Introduction to R
#
# This file is part of IPSUR.
#
# IPSUR is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# IPSUR is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
#+END_SRC
Every \(\mathsf{R}\) book I have ever seen has had a section/chapter
that is an introduction to \(\mathsf{R}\), and so does this one. The
goal of this chapter is for a person to get up and running, ready for
the material that follows. See Section [[#sec-External-Resources]] for links
to other material which the reader may find useful.
*What do I want them to know?*
- Where to find \(\mathsf{R}\) to install on a home computer, and a
few comments to help with the usual hiccups that occur when
installing something.
- Abbreviated remarks about the available options to interact with
\(\mathsf{R}\).
- Basic operations (arithmetic, entering data, vectors) at the command
prompt.
- How and where to find help when they get in trouble.
- Other little shortcuts I am usually asked when introducing
\(\mathsf{R}\).
** Downloading and Installing \(\mathsf{R}\)
:PROPERTIES:
:CUSTOM_ID: sec-download-install-R
:END:
The instructions for obtaining \(\mathsf{R}\) largely depend on the
user's hardware and operating system. The \(\mathsf{R}\) Project has
written an \(\mathsf{R}\) Installation and Administration manual with
complete, precise instructions about what to do, together with all
sorts of additional information. The following is just a primer to get
a person started.
*** Installing \(\mathsf{R}\)
Visit one of the links below to download the latest version of \(\mathsf{R}\)
for your operating system:
- Microsoft Windows: :: http://cran.r-project.org/bin/windows/base/
- MacOS: :: http://cran.r-project.org/bin/macosx/
- Linux: :: http://cran.r-project.org/bin/linux/
On Microsoft Windows, click the =R-x.y.z.exe= installer to start
installation. When it asks for "Customized startup options", specify
=Yes=. In the next window, be sure to select the SDI (single document
interface) option; this is useful later when we discuss three
dimensional plots with the =rgl= package \cite{rgl}.
**** Installing \(\mathsf{R}\) on a USB drive (Windows)
With this option you can use \(\mathsf{R}\) portably and without
administrative privileges. There is an entry in the \(\mathsf{R}\) for
Windows FAQ about this. Here is the procedure I use:
1. Download the Windows installer above and start installation as
usual. When it asks /where/ to install, navigate to the top-level
directory of the USB drive instead of the default =C= drive.
2. When it asks whether to modify the Windows registry, uncheck the
box; we do NOT want to tamper with the registry.
3. After installation, change the name of the folder from =R-x.y.z= to
just plain \(\mathsf{R}\). (Even quicker: do this in step 1.)
4. [[http://ipsur.r-forge.r-project.org/book/download/R.exe][Download this shortcut]] and move it to the top-level directory of
the USB drive, right beside the \(\mathsf{R}\) folder, not inside
the folder. Use the downloaded shortcut to run \(\mathsf{R}\).
Steps 3 and 4 are not required but save you the trouble of navigating
to the =R-x.y.z/bin= directory to double-click =Rgui.exe= every time
you want to run the program. It is useless to create your own shortcut
to =Rgui.exe=. Windows does not allow shortcuts to have relative
paths; they always have a drive letter associated with them. So if you
make your own shortcut and plug your USB drive into some /other/
machine that happens to assign your drive a different letter, then
your shortcut will no longer be pointing to the right place.
*** Installing and Loading Add-on Packages
:PROPERTIES:
:CUSTOM_ID: sub-installing-loading-packages
:END:
There are /base/ packages (which come with \(\mathsf{R}\)
automatically), and /contributed/ packages (which must be downloaded
for installation). For example, on the version of \(\mathsf{R}\) being
used for this document the default base packages loaded at startup are
#+BEGIN_SRC R :exports both :results output pp
getOption("defaultPackages")
#+END_SRC
#+RESULTS:
: [1] "datasets" "utils" "grDevices" "graphics" "stats" "methods"
The base packages are maintained by a select group of volunteers,
called \(\mathsf{R}\) Core. In addition to the base packages, there
are literally thousands of additional contributed packages written by
individuals all over the world. These are stored worldwide on mirrors
of the Comprehensive \(\mathsf{R}\) Archive Network, or =CRAN= for
short. Given an active Internet connection, anybody is free to
download and install these packages and even inspect the source code.
To install a package named =foo=, open up \(\mathsf{R}\) and type
=install.packages("foo")=
@@latex:\index{install.packages@\texttt{install.packages}}@@. To
install =foo= and additionally install all of the other packages on
which =foo= depends, instead type =install.packages("foo", depends =
TRUE)=.
The general command =install.packages()= will (on most operating
systems) open a window containing a huge list of available packages;
simply choose one or more to install.
No matter how many packages are installed onto the system, each one
must first be loaded for use with the
=library= @@latex:\index{library@\texttt{library}}@@ function. For instance, the
=foreign= package \cite{foreign} contains all sorts of functions
needed to import data sets into \(\mathsf{R}\) from other software
such as SPSS, SAS, /etc/. But none of those functions will be
available until the command =library("foreign")= is issued.
Type =library()= at the command prompt (described below) to see a list
of all available packages in your library.
For complete, precise information regarding installation of
\(\mathsf{R}\) and add-on packages, see the [[http://cran.r-project.org/manuals.html][R Installation and Administration manual]].
** Communicating with \(\mathsf{R}\)
:PROPERTIES:
:CUSTOM_ID: sec-Communicating-with-R
:END:
*** One line at a time
This is the most basic method and is the first one that beginners will use.
- RGui (Microsoft \(\circledR\) Windows)
- Terminal
- Emacs/ESS, XEmacs
- JGR
*** Multiple lines at a time
For longer programs (called /scripts/) there is too much code to write
all at once at the command prompt. Furthermore, for longer scripts it
is convenient to be able to only modify a certain piece of the script
and run it again in \(\mathsf{R}\). Programs called /script editors/
are specially designed to aid the communication and code writing
process. They have all sorts of helpful features including
\(\mathsf{R}\) syntax highlighting, automatic code completion,
delimiter matching, and dynamic help on the \(\mathsf{R}\) functions
as they are being written. Even more, they often have all of the text
editing features of programs like Microsoft\(\circledR\)Word. Lastly,
most script editors are fully customizable in the sense that the user
can customize the appearance of the interface to choose what colors to
display, when to display them, and how to display them.
- \(\mathsf{R}\) Editor (Windows): :: @@latex:\index{R
Editor@\textsf{R} Editor}@@ In Microsoft\(\circledR\) Windows,
\(\mathsf{R}\) Gui has its own built-in script editor, called
\(\mathsf{R}\) Editor. From the console window, select =File=
\(\triangleright\) =New Script=. A script window opens, and the
lines of code can be written in the window. When satisfied with
the code, the user highlights all of the commands and presses
\textsf{Ctrl+R}. The commands are automatically run at once in
\(\mathsf{R}\) and the output is shown. To save the script for
later, click =File= \(\triangleright\) =Save as...= in
\(\mathsf{R}\) Editor. The script can be reopened later with
=File= \(\triangleright\)} =Open Script...= in =RGui=. Note that
\(\mathsf{R}\) Editor does not have the fancy syntax highlighting
that the others do.
- \(\mathsf{R}\) WinEdt: :: @@latex:\index{RWinEdt@\textsf{R}WinEdt}@@
This option is coordinated with WinEdt for \LaTeX{} and has
additional features such as code highlighting, remote sourcing,
and a ton of other things. However, one first needs to download
and install a shareware version of another program, WinEdt, which
is only free for a while -- pop-up windows will eventually appear
that ask for a registration code. \(\mathsf{R}\) WinEdt is
nevertheless a very fine choice if you already own WinEdt or are
planning to purchase it in the near future.
- Tinn \(\mathsf{R}\) / Sciviews K: :: @@latex:\index{Tinn R@Tinn
\textsf{R}}\index{Sciviews K}@@ This one is completely free and
has all of the above mentioned options and more. It is simple
enough to use that the user can virtually begin working with it
immediately after installation. But Tinn \(\mathsf{R}\) proper is
only available for Microsoft\(\circledR\) Windows operating
systems. If you are on MacOS or Linux, a comparable alternative
is Sci-Views - Komodo Edit.
- Emacs/ESS: :: @@latex:\index{Emacs}\index{ESS}@@ Emacs is an all
purpose text editor. It can do absolutely anything
with respect to modifying, searching, editing, and
manipulating, text. And if Emacs can't do it, then you
can write a program that extends Emacs to do it. Once
such extension is called =ESS=, which stands for
/E/-macs /S/-peaks /S/-tatistics. With ESS a person
can speak to \(\mathsf{R}\), do all of the tricks that
the other script editors offer, and much, much,
more. Please see the following for installation
details, documentation, reference cards, and a whole
lot more: http://ess.r-project.org. /Fair warning/:
if you want to try Emacs and if you grew up with
Microsoft\(\circledR\) Windows or Macintosh, then you
are going to need to relearn everything you thought
you knew about computers your whole life. (Or, since
Emacs is completely customizable, you can reconfigure
Emacs to behave the way you want.) I have personally
experienced this transformation and I will never go
back.
- JGR (read "Jaguar"): :: @@latex:\index{JGR}@@ This one has the bells
and whistles of =RGui= plus it is based on Java, so it works on
multiple operating systems. It has its own script editor like
\(\mathsf{R}\) Editor but with additional features such as syntax
highlighting and code-completion. If you do not use
Microsoft\(\circledR\) Windows (or even if you do) you definitely
want to check out this one.
- Kate, Bluefish, /etc/ :: There are literally dozens of other text
editors available, many of them free, and each has its own
(dis)advantages. I only have mentioned the ones with which I have
had substantial personal experience and have enjoyed at some
point. Play around, and let me know what you find.
*** Graphical User Interfaces (GUIs)
By the word "GUI" I mean an interface in which the user communicates
with \(\mathsf{R}\) by way of points-and-clicks in a menu of some
sort. Again, there are many, many options and I only mention ones that
I have used and enjoyed. Some of the other more popular script editors
can be downloaded from the \(\mathsf{R}\)-Project website at
http://www.sciviews.org/_rgui/. On the left side of the screen
(under *Projects*) there are several choices available.
- \(\mathsf{R}\) Commander :: provides a point-and-click interface to
many basic statistical tasks. It is called the "Commander"
because every time one makes a selection from the menus, the code
corresponding to the task is listed in the output window. One can
take this code, copy-and-paste it to a text file, then re-run it
again at a later time without the \(\mathsf{R}\) Commander's
assistance. It is well suited for the introductory level. =Rcmdr=
\cite{Rcmdr} also allows for user-contributed "Plugins" which are
separate packages on =CRAN= that add extra functionality to the
=Rcmdr= package. The plugins are typically named with the prefix
=RcmdrPlugin= to make them easy to identify in the =CRAN= package
list. One such plugin is the =RcmdrPlugin.IPSUR= package
\cite{RcmdrPlugin.IPSUR} which accompanies this text.
- Poor Man's GUI :: @@latex:\index{Poor Man's GUI}@@ is an alternative
to the =Rcmdr= which is based on GTk instead of
Tcl/Tk. It has been a while since I used it but I
remember liking it very much when I did. One thing
that stood out was that the user could
drag-and-drop data sets for plots. See here for
more information:
http://wiener.math.csi.cuny.edu/pmg/.
- Rattle :: @@latex:\index{Rattle}@@ is a data mining toolkit which
was designed to manage/analyze very large data sets, but
it provides enough other general functionality to merit
mention here. See \cite{rattle} for more information.
- Deducer :: @@latex:\index{Deducer}@@ is relatively new and shows
promise from what I have seen, but I have not actually
used it in the classroom yet.
** Basic \(\mathsf{R}\) Operations and Concepts
:PROPERTIES:
:CUSTOM_ID: sec-Basic-R-Operations
:END:
The \(\mathsf{R}\) developers have written an introductory document
entitled "An Introduction to \(\mathsf{R}\)". There is a sample
session included which shows what basic interaction with
\(\mathsf{R}\) looks like. I recommend that all new users of
\(\mathsf{R}\) read that document, but bear in mind that there are
concepts mentioned which will be unfamiliar to the beginner.
Below are some of the most basic operations that can be done with
\(\mathsf{R}\). Almost every book about \(\mathsf{R}\) begins with a
section like the one below; look around to see all sorts of things
that can be done at this most basic level.
*** Arithmetic
:PROPERTIES:
:CUSTOM_ID: sub-Arithmetic
:END:
#+BEGIN_SRC R :exports both :results output pp
2 + 3 # add
4 * 5 / 6 # multiply and divide
7^8 # 7 to the 8th power
#+END_SRC
#+RESULTS:
: [1] 5
: [1] 3.333333
: [1] 5764801
Notice the comment character =#=. Anything typed
after a =#= symbol is ignored by \(\mathsf{R}\). We know that \(20/6\)
is a repeating decimal, but the above example shows only 7 digits. We
can change the number of digits displayed with
=options= @@latex:\index{options@\texttt{options}}@@:
#+BEGIN_SRC R :exports both :results output pp
options(digits = 16)
10/3 # see more digits
sqrt(2) # square root
exp(1) # Euler's constant, e
pi
options(digits = 7) # back to default
#+END_SRC
#+RESULTS:
: [1] 3.333333333333333
: [1] 1.414213562373095
: [1] 2.718281828459045
: [1] 3.141592653589793
Note that it is possible to set =digits=
@@latex:\index{digits@\texttt{digits}}@@ up to 22, but setting them
over 16 is not recommended (the extra significant digits are not
necessarily reliable). Above notice the =sqrt=
@@latex:\index{sqrt@\texttt{sqrt}}@@ function for square roots and the
=exp= @@latex:\index{exp@\texttt{exp}}@@ function for powers of
\(\mathrm{e}\), Euler's number.
*** Assignment, Object names, and Data types
:PROPERTIES:
:CUSTOM_ID: sub-Assignment-Object-names
:END:
It is often convenient to assign numbers and values to variables
(objects) to be used later. The proper way to assign values to a
variable is with the =<-= operator (with a space on either side). The
~=~ symbol works too, but it is recommended by the \(\mathsf{R}\)
masters to reserve ~=~ for specifying arguments to functions
(discussed later). In this book we will follow their advice and use
=<-= for assignment. Once a variable is assigned, its value can be
printed by simply entering the variable name by itself.
#+BEGIN_SRC R :exports both :results output pp
x <- 7*41/pi # don't see the calculated value
x # take a look
#+END_SRC
#+RESULTS:
: [1] 91.35494
When choosing a variable name you can use letters, numbers, dots
"\texttt{.}", or underscore "\texttt{\_}" characters. You cannot
use mathematical operators, and a leading dot may not be followed by a
number. Examples of valid names are: =x=, =x1=, =y.value=, and
=!y_hat=. (More precisely, the set of allowable characters in object
names depends on one's particular system and locale; see An
Introduction to \(\mathsf{R}\) for more discussion on this.)
Objects can be of many /types/, /modes/, and /classes/. At this level,
it is not necessary to investigate all of the intricacies of the
respective types, but there are some with which you need to become
familiar:
- integer: :: the values \(0\), \(\pm1\), \(\pm2\), ...; these are
represented exactly by \(\mathsf{R}\).
- double: :: real numbers (rational and irrational); these numbers are
not represented exactly (save integers or fractions with
a denominator that is a power of 2, see
\cite{Venables2010}).
- character: :: elements that are wrapped with pairs of ="= or ';
- logical: :: includes =TRUE=, =FALSE=, and =NA= (which are reserved
words); the =NA= @@latex:\index{NA@\texttt{NA}}@@ stands
for "not available", /i.e./, a missing value.
You can determine an object's type with the =typeof=
@@latex:\index{typeof@\texttt{typeof}}@@ function. In addition to the above,
there is the =complex= @@latex:\index{complex@\texttt{complex}}
\index{as.complex@\texttt{as.complex}}@@ data type:
#+BEGIN_SRC R :exports both :results output pp
sqrt(-1) # isn't defined
sqrt(-1+0i) # is defined
sqrt(as.complex(-1)) # same thing
(0 + 1i)^2 # should be -1
typeof((0 + 1i)^2)
#+END_SRC
#+RESULTS:
: [1] NaN
: [1] 0+1i
: [1] 0+1i
: [1] -1+0i
: [1] "complex"
Note that you can just type =(1i)^2= to get the same answer. The
=NaN= @@latex:\index{NaN@\texttt{NaN}}@@ stands for "not a number"; it is
represented internally as =double= @@latex:\index{double}@@.
*** Vectors
:PROPERTIES:
:CUSTOM_ID: sub-Vectors
:END:
All of this time we have been manipulating vectors of length 1. Now
let us move to vectors with multiple entries.
**** Entering data vectors
*The long way:* @@latex:\index{c@\texttt{c}}@@ If you would like to enter the
data =74,31,95,61,76,34,23,54,96= into \(\mathsf{R}\), you may create
a data vector with the =c= function (which is short for
/concatenate/).
#+BEGIN_SRC R :exports both :results output pp
x <- c(74, 31, 95, 61, 76, 34, 23, 54, 96)
x
#+END_SRC
#+RESULTS:
: [1] 74 31 95 61 76 34 23 54 96
The elements of a vector are usually coerced by \(\mathsf{R}\) to the
the most general type of any of the elements, so if you do =c(1, "2")=
then the result will be =c("1", "2")=.
*A shorter way:* @@latex:\index{scan@\texttt{scan}}@@: The =scan= method is
useful when the data are stored somewhere else. For instance, you may
type =x <- scan()= at the command prompt and \(\mathsf{R}\) will
display =1:= to indicate that it is waiting for the first data
value. Type a value and press =Enter=, at which point \(\mathsf{R}\)
will display =2:=, and so forth. Note that entering an empty line
stops the scan. This method is especially handy when you have a column
of values, say, stored in a text file or spreadsheet. You may copy and
paste them all at the =1:= prompt, and \(\mathsf{R}\) will store all
of the values instantly in the vector =x=.
*Repeated data; regular patterns:* the =seq= @@latex:\index{seq@\texttt{seq}}@@
function will generate all sorts of sequences of numbers. It has the
arguments =from=, =to=, =by=, and =length.out= which can be set in
concert with one another. We will do a couple of examples to show you
how it works.
#+BEGIN_SRC R :exports both :results output pp
seq(from = 1, to = 5)
seq(from = 2, by = -0.1, length.out = 4)
#+END_SRC
#+RESULTS:
: [1] 1 2 3 4 5
: [1] 2.0 1.9 1.8 1.7
Note that we can get the first line much quicker with the colon
operator.
#+BEGIN_SRC R :exports both :results output pp
1:5
#+END_SRC
#+RESULTS:
: [1] 1 2 3 4 5
The vector =LETTERS= @@latex:\index{LETTERS@\texttt{LETTERS}}@@ has the 26
letters of the English alphabet in uppercase and
=letters= @@latex:\index{letters@\texttt{letters}}@@ has all of them in
lowercase.
**** Indexing data vectors
Sometimes we do not want the whole vector, but just a piece of it. We
can access the intermediate parts with the =[]= operator. Observe
(with =x= defined above)
#+BEGIN_SRC R :exports both :results output pp
x[1]
x[2:4]
x[c(1,3,4,8)]
x[-c(1,3,4,8)]
#+END_SRC
#+RESULTS:
: [1] 74
: [1] 31 95 61
: [1] 74 95 61 54
: [1] 31 76 34 23 96
Notice that we used the minus sign to specify those elements that we
do /not/ want.
#+BEGIN_SRC R :exports both :results output pp
LETTERS[1:5]
letters[-(6:24)]
#+END_SRC
#+RESULTS:
: [1] "A" "B" "C" "D" "E"
: [1] "a" "b" "c" "d" "e" "y" "z"
*** Functions and Expressions
:PROPERTIES:
:CUSTOM_ID: sub-Functions-and-Expressions
:END:
A function takes arguments as input and returns an object as
output. There are functions to do all sorts of things. We show some
examples below.
#+BEGIN_SRC R :exports both :results output pp
x <- 1:5
sum(x)
length(x)
min(x)
mean(x) # sample mean
sd(x) # sample standard deviation
#+END_SRC
#+RESULTS:
: [1] 15
: [1] 5
: [1] 1
: [1] 3
: [1] 1.581139
It will not be long before the user starts to wonder how a particular
function is doing its job, and since \(\mathsf{R}\) is open-source,
anybody is free to look under the hood of a function to see how things
are calculated. For detailed instructions see the article "Accessing
the Sources" by Uwe Ligges \cite{Ligges2006}. In short:
*Type the name of the function* without any parentheses or
arguments. If you are lucky then the code for the entire function will
be printed, right there looking at you. For instance, suppose that we
would like to see how the =intersect=
@@latex:\index{intersect@\texttt{intersect}}@@ function works:
#+BEGIN_SRC R :exports both :results output pp
intersect
#+END_SRC
#+RESULTS:
: function (x, ...)
: UseMethod("intersect")
: <environment: namespace:prob>
*If instead* it shows =UseMethod(something)=
@@latex:\index{UseMethod@\texttt{UseMethod}}@@ then you will need to choose the
/class/ of the object to be inputted and next look at the /method/
that will be /dispatched/ to the object. For instance, typing =rev=
@@latex:\index{rev@\texttt{rev}}@@ says
#+BEGIN_SRC R :exports both :results output pp
rev
#+END_SRC
#+RESULTS:
: function (x)
: UseMethod("rev")
: <bytecode: 0x8bd24a8>
: <environment: namespace:base>
The output is telling us that there are multiple methods associated
with the =rev= function. To see what these are, type
#+BEGIN_SRC R :exports both :results output pp
methods(rev)
#+END_SRC
#+RESULTS:
: [1] rev.default rev.dendrogram* rev.likert* rev.zoo rev.zooreg*
:
: Non-visible functions are asterisked
Now we learn that there are two different =rev(x)= functions, only one
of which being chosen at each call depending on what =x= is. There is
one for =dendrogram= objects and a =default= method for everything
else. Simply type the name to see what each method does. For example,
the =default= method can be viewed with
#+BEGIN_SRC R :exports both :results output pp
rev.default
#+END_SRC
#+RESULTS:
: function (x)
: if (length(x)) x[length(x):1L] else x
: <bytecode: 0xa32509c>
: <environment: namespace:base>
*Some functions are hidden* by a /namespace/ (see An Introduction to
\(\mathsf{R}\) \cite{Venables2010}), and are not visible on the first
try. For example, if we try to look at the code for =wilcox.test=
@@latex:\index{wilcox.test@\texttt{wilcox.test}}@@ (see Chapter [[#cha-Nonparametric-Statistics]]) we get the following:
#+BEGIN_SRC R :exports both :results output pp
wilcox.test
methods(wilcox.test)
#+END_SRC
#+RESULTS:
: function (x, ...)
: UseMethod("wilcox.test")
: <bytecode: 0x8957fd0>
: <environment: namespace:stats>
: [1] wilcox.test.default* wilcox.test.formula*
:
: Non-visible functions are asterisked
If we were to try =wilcox.test.default= we would get a "not found"
error, because it is hidden behind the namespace for the package
=stats= \cite{stats} (shown in the last line when we tried
=wilcox.test=). In cases like these we prefix the package name to the
front of the function name with three colons; the command
=stats:::wilcox.test.default= will show the source code, omitted here
for brevity.
*If it shows* =.Internal(something)=
@@latex:\index{.Internal@\texttt{.Internal}}@@ or =.Primitive(something)=
@@latex:\index{.Primitive@\texttt{.Primitive}}@@, then it will be necessary to
download the source code of \(\mathsf{R}\) (which is /not/ a binary
version with an =.exe= extension) and search inside the code
there. See Ligges \cite{Ligges2006} for more discussion on this. An
example is =exp=:
#+BEGIN_SRC R :exports both :results output pp
exp
#+END_SRC
#+RESULTS:
: function (x) .Primitive("exp")
Be warned that most of the =.Internal= functions are written in other
computer languages which the beginner may not understand, at least
initially.
** Getting Help
:PROPERTIES:
:CUSTOM_ID: sec-Getting-Help
:END:
When you are using \(\mathsf{R}\), it will not take long before you
find yourself needing help. Fortunately, \(\mathsf{R}\) has extensive
help resources and you should immediately become familiar with
them. Begin by clicking =Help= on =RGui=. The following options are
available.
- Console: :: gives useful shortcuts, for instance, =Ctrl+L=, to clear
the \(\mathsf{R}\) console screen.
- FAQ on \(\mathsf{R}\): :: frequently asked questions concerning
general \(\mathsf{R}\) operation.
- FAQ on \(\mathsf{R}\) for Windows: :: frequently asked questions
about \(\mathsf{R}\), tailored to the Microsoft Windows operating
system.
- Manuals: :: technical manuals about all features of the
\(\mathsf{R}\) system including installation, the
complete language definition, and add-on packages.
- \(\mathsf{R}\) functions (text)...: :: use this if you know the
/exact/ name of the function you want to know more about, for
example, =mean= or =plot=. Typing =mean= in the window is
equivalent to typing =help("mean")=
@@latex:\index{help@\texttt{help}}@@ at the command line, or more
simply, =?mean= @@latex:\index{?@\texttt{?}}@@. Note that this
method only works if the function of interest is contained in a
package that is already loaded into the search path with
=library=.
- HTML Help: :: use this to browse the manuals with point-and-click
links. It also has a Search Engine \& Keywords for
searching the help page titles, with point-and-click
links for the search results. This is possibly the
best help method for beginners. It can be started from
the command line with the command
=help.start()= @@latex:\index{help.start@\texttt{help.start}}@@.
- Search help ...: :: use this if you do not know the exact name of
the function of interest, or if the function is in a package that
has not been loaded yet. For example, you may enter =plo= and a
text window will return listing all the help files with an alias,
concept, or title matching `=plo=' using regular expression
matching; it is equivalent to typing
=help.search("plo")= @@latex:\index{help.search@\texttt{help.search}}@@ at
the command line. The advantage is that you do not need to know
the exact name of the function; the disadvantage is that you
cannot point-and-click the results. Therefore, one may wish to
use the HTML Help search engine instead. An equivalent way is
=??plo= @@latex:\index{??@\texttt{??}}@@ at the command line.
- search.r-project.org ...: :: this will search for words in help
lists and email archives of the \(\mathsf{R}\) Project. It can be
very useful for finding other questions that other users have
asked.
- Apropos ...: :: use this for more sophisticated partial name
matching of functions. See =?apropos=
@@latex:\index{apropos@\texttt{apropos}}@@ for details.
On the help pages for a function there are sometimes "Examples"
listed at the bottom of the page, which will work if copy-pasted at
the command line (unless marked otherwise). The =example=
@@latex:\index{example@\texttt{example}}@@ function will run the code
automatically, skipping the intermediate step. For instance, we may
try =example(mean)= to see a few examples of how the =mean= function
works.
*** \(\mathsf{R}\) Help Mailing Lists
There are several mailing lists associated with \(\mathsf{R}\), and
there is a huge community of people that read and answer questions
related to \(\mathsf{R}\). See [[http://www.r-project.org/mail.html][here]] for an idea of what is
available. Particularly pay attention to the bottom of the page which
lists several special interest groups (SIGs) related to
\(\mathsf{R}\).
Bear in mind that \(\mathsf{R}\) is free software, which means that it
was written by volunteers, and the people that frequent the mailing
lists are also volunteers who are not paid by customer support
fees. Consequently, if you want to use the mailing lists for free
advice then you must adhere to some basic etiquette, or else you may
not get a reply, or even worse, you may receive a reply which is a bit
less cordial than you are used to. Below are a few considerations:
1. Read the [[http://cran.r-project.org/faqs.html][FAQ]]. Note that there are different FAQs for different
operating systems. You should read these now, even without a
question at the moment, to learn a lot about the idiosyncrasies of
\(\mathsf{R}\).
2. Search the archives. Even if your question is not a FAQ, there is a
very high likelihood that your question has been asked before on
the mailing list. If you want to know about topic =foo=, then you
can do =RSiteSearch("foo")=
@@latex:\index{RSiteSearch@\texttt{RSiteSearch}}@@ to search the
mailing list archives (and the online help) for it.
3. Do a Google search and an \texttt{RSeek.org} search.
If your question is not a FAQ, has not been asked on
\(\mathsf{R}\)-help before, and does not yield to a Google (or
alternative) search, then, and only then, should you even consider
writing to \(\mathsf{R}\)-help. Below are a few additional
considerations.
- Read the [[http://www.r-project.org/posting-guide.html][posting guide]] before posting. This will save you a lot of
trouble and pain.
- Get rid of the command prompts (=>=) from output. Readers of your
message will take the text from your mail and copy-paste into an
\(\mathsf{R}\) session. If you make the readers' job easier then it
will increase the likelihood of a response.
- Questions are often related to a specific data set, and the best way
to communicate the data is with a =dump= @@latex:\index{dump@\texttt{dump}}@@
command. For instance, if your question involves data stored in a
vector =x=, you can type =dump("x","")= at the command prompt and
copy-paste the output into the body of your email message. Then the
reader may easily copy-paste the message from your email into
\(\mathsf{R}\) and =x= will be available to him/her.
- Sometimes the answer the question is related to the operating system
used, the attached packages, or the exact version of \(\mathsf{R}\)
being used. The =sessionInfo()=
@@latex:\index{sessionInfo@\texttt{sessionInfo}}@@ command collects
all of this information to be copy-pasted into an email (and the
Posting Guide requests this information). See Appendix
[[#cha-R-Session-Information]] for an example.
** External Resources
:PROPERTIES:
:CUSTOM_ID: sec-External-Resources
:END:
There is a mountain of information on the Internet about
\(\mathsf{R}\). Below are a few of the important ones.
- The \(\mathsf{R}\)- Project for Statistical Computing @@latex:\index{The
R-Project@The \textsf{R}-Project}@@: Go [[http://www.r-project.org/][there]] first.
- The Comprehensive \(\mathsf{R}\) Archive Network
@@latex:\index{CRAN}@@: [[http://cran.r-project.org/][That is
where]] \(\mathsf{R}\) is stored along with thousands of contributed
packages. There are also loads of contributed information (books,
tutorials, /etc/.). There are mirrors all over the world with
duplicate information.
- \(\mathsf{R}\)-Forge @@latex:\index{R-Forge@\textsf{R}-Forge}@@: [[http://r-forge.r-project.org/][This is
another location]] where \(\mathsf{R}\) packages are stored. Here you
can find development code which has not yet been released to =CRAN=.
- \(\mathsf{R}\)-Wiki @@latex:\index{R-Wiki@\textsf{R}-Wiki}@@: There are many
tips, tricks, and general advice [[http://wiki.r-project.org/rwiki/doku.php][listed here]]. If you find a trick of
your own, login and share it with the world.
- Other: the [[http://addictedtor.free.fr/graphiques/][\(\mathsf{R}\) Graph Gallery]] @@latex:\index{R Graph
Gallery@\textsf{R} Graph Gallery}@@ and [[http://bm2.genes.nig.ac.jp/RGM2/index.php][\(\mathsf{R}\) Graphical
Manual]] @@latex:\index{R Graphical Manual@\textsf{R} Graphical Manual}@@ have
literally thousands of graphs to peruse. [[http://www.rseek.org][\(\mathsf{R}\) Seek]] is a
search engine based on Google specifically tailored for
\(\mathsf{R}\) queries.
** Other Tips
It is unnecessary to retype commands repeatedly, since \(\mathsf{R}\)
remembers what you have recently entered on the command line. On the
Microsoft\(\circledR\) Windows \(\mathsf{R}\) Gui, to cycle through
the previous commands just push the \(\uparrow\) (up arrow) key. On
Emacs/ESS the command is =M-p= (which means hold down the =Alt= button
and press "p"). More generally, the command =history()=
@@latex:\index{history@\texttt{history}}@@ will show a whole list of recently
entered commands.
- To find out what all variables are in the current work environment,
use the commands =objects()= @@latex:\index{objects@\texttt{objects}}@@ or
=ls()= @@latex:\index{ls@\texttt{ls}}@@. These list all available objects in
the workspace. If you wish to remove one or more variables, use
=remove(var1, var2, var3)= @@latex:\index{remove@\texttt{remove}}@@, or more
simply use =rm(var1, var2, var3)=, and to remove all objects use
=rm(list = ls())=.
- Another use of =scan= is when you have a long list of numbers
(separated by spaces or on different lines) already typed somewhere
else, say in a text file To enter all the data in one fell swoop,
first highlight and copy the list of numbers to the Clipboard with
=Edit= \(\triangleright\) =Copy= (or by right-clicking and selecting
=Copy=). Next type the =x <- scan()= command in the \(\mathsf{R}\)
console, and paste the numbers at the =1:= prompt with =Edit=
\(\triangleright\) =Paste=. All of the numbers will automatically be
entered into the vector =x=.
- The command =Ctrl+l= clears the display in the
Microsoft\(\circledR\) Windows \(\mathsf{R}\) Gui. In Emacs/ESS,
press =Ctrl+l= repeatedly to cycle point (the place where the cursor
is) to the bottom, middle, and top of the display.
- Once you use \(\mathsf{R}\) for awhile there may be some commands
that you wish to run automatically whenever \(\mathsf{R}\)
starts. These commands may be saved in a file called =Rprofile.site=
@@latex:\index{Rprofile.site@\texttt{Rprofile.site}}@@ which is
usually in the =etc= folder, which lives in the \(\mathsf{R}\) home
directory (which on Microsoft\(\circledR\) Windows usually is
=C:\Program Files\R=). Alternatively, you can make a file
=.Rprofile= @@latex:\index{.Rprofile@\texttt{.Rprofile}}@@ to be
stored in the user's home directory, or anywhere \(\mathsf{R}\) is
invoked. This allows for multiple configurations for different
projects or users. See "Customizing the Environment" of /An
Introduction to R/ for more details.
- When exiting \(\mathsf{R}\) the user is given the option to "save
the workspace". I recommend that beginners DO NOT save the
workspace when quitting. If =Yes= is selected, then all of the
objects and data currently in \(\mathsf{R}\)'s memory is saved in a
file located in the working directory called
=.RData= @@latex:\index{.RData@\texttt{.RData}}@@. This file is then
automatically loaded the next time \(\mathsf{R}\) starts (in which
case \(\mathsf{R}\) will say =[previously saved workspace
restored]=). This is a valuable feature for experienced users of
\(\mathsf{R}\), but I find that it causes more trouble than it saves
with beginners.
#+LaTeX: \newpage{}
** Exercises
#+LaTeX: \setcounter{thm}{0}
* Data Description :datadesc:
:PROPERTIES:
:tangle: R/03-datadesc.R
:CUSTOM_ID: cha-Describing-Data-Distributions
:END:
#+BEGIN_SRC R :exports none :eval never
# IPSUR: Introduction to Probability and Statistics Using R
# Copyright (C) 2014 G. Jay Kerns
#
# Chapter: Data Description
#
# This file is part of IPSUR.
#
# IPSUR is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# IPSUR is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
#+END_SRC
#+BEGIN_SRC R :exports none :eval no-export
# This chapter's package dependencies
library(aplpack)
library(qcc)
library(e1071)
library(lattice)
library(ggplot2)
#+END_SRC
#+LaTeX: \noindent
In this chapter we introduce the different types of data that a
statistician is likely to encounter, and in each subsection we give
some examples of how to display the data of that particular type. Once
we see how to display data distributions, we next introduce the basic
properties of data distributions. We qualitatively explore several
data sets. Once that we have intuitive properties of data sets, we
next discuss how we may numerically measure and describe those