Академический Документы
Профессиональный Документы
Культура Документы
Preparation
Data
preparation
is
the
very
first
thing
that
you
do
and
spend
a
lot
of
time
on
as
a
data
analyst
much
before
trying
to
build
predictive
models
using
that
data.
In
essence
data
preparation
is
all
about
processing
data
to
get
it
ready
for
all
kinds
of
analysis.
All
industry
data
collection
is
mostly
driven
by
business
process
at
front
,
not
by
the
needs
of
predictive
models.
These
various
processes
at
some
or
the
other
point
become
reason
for
introduction
of
errors
here
and
there
in
the
data.
There
can
be
many
kind
of
reasons
[not
necessarily
errors
]
for
which
we'd
need
to
pre
process
our
data
and
change
it
for
better.
Missing
data
Potentially
incorrect
data
Need
for
changing
form
of
the
data
We'll
discuss
various
reasons
and
methods
to
achieve
our
pre-processing
goals
going
forward.
If
observations
with
missing
values
are
significant
chunk
of
the
data
then
you
should
not
drop
all
observations
with
missing
values
If
the
variable
which
had
missing
values
has
entered
in
your
model,
you
need
to
plan
what
to
do
when
you
encounter
missing
values
in
the
unseen
data
while
model
has
been
put
in
production.
Imputing
[filling
up]
missing
values
with
mean/median/mode
of
the
respective
variables.
Many
at
times
,
we
know
what
a
missing
value
might
mean
in
the
context
of
business
process.
For
example,
If
account
balance
is
missing
for
the
bank
account
,
it
might
mean
that
the
account
balance
is
zero.
Treatment
of
Outliers:
Because
of
otuliers
,
the
predictor
variables
ranges
get
inflated
artificially
.
The
model
that
you
get
might
not
be
applicable
across
that
range
Some
outliers
have
high
leverage
in
context
of
the
modelling
process.
In
presence
of
such
observations
you'll
get
a
model
which
is
not
a
good
fit
for
the
general
population
[data].
If
you
are
preparing
data
for
predictive
modelling
,
you
need
to
remove
outliers.
However
if
the
variable
with
outliers
is
present
in
the
model,
you
need
to
figure
out
what
to
do
when
you
encounter
outlier
values
in
the
unseen
data
while
model
has
been
put
in
production.
Flooring/Capping
In
some
cases
it
might
make
sense
to
impute
outlying
values
with
upper
and
lower
limits
when
they
exceed
either
of
these
values.
Imputing
with
lower
limit
is
called
flooring
and
imputing
with
upper
limit
is
called
capping.
Many
at
times
,
we
know
what
an
outlier
value
might
mean
in
the
context
of
business
process.
Transposing
Data
This
is
one
of
the
very
useful
procedures
we'll
learn
here.
Below
given
is
an
example
of
long
data
famid
1
1
1
2
2
2
3
3
year
96
97
98
96
97
98
96
98
faminc
40000
40500
41000
45000
45400
45800
75000
77000
sometimes
it'd
make
sense
to
this
kind
of
the
data
into
a
wide
format
.Below
given
is
an
example
of
same
data
in
a
wide
format.
famid
1
2
3
year_96
40000
45000
75000
year_97
40500
45400
.
year_98
41000
45800
77000
Since
SAS
process
data
row
by
row
in
many
procedures
as
well
as
in
data
step
codes,
many
at
times
these
kind
of
transformation
are
very
much
needed.
We'll
learn
how
to
achieve
the
same
with
Proc
Transpose.
Formatting
Data
Columns,
Creating
Reports
In
addition
to
other
tools
we'll
also
learn
very
useful
procedures
for
creating
all
kinds
of
reports
and
user
defined
data
format
using
Proc
Report
and
Proc
Format
proc
sort
data=dp.bank_transactions;
by
year
month
dc
descending
amount;
run;
Find
this
works
out
alright
but
as
we
have
seen
before
,
taking
output
of
proc
means
to
output
dataset
is
not
a
straight
forward
task.Lets
learn
about
"first."
and
"last.",
these
are
temporary
variables
created
at
the
back
end
when
a
by
statement
is
used
in
data
step
code.
[
keep
in
mind
that
"by"
statement
can
be
used
after
sorting
your
data
only
].
Lets
create
the
data
that
we'll
be
using
to
learn
for
the
same:
data
example;
input
grps
section
$
score;
cards;
1
a
10
1
a
20
1
b
30
1
b
40
2
a
50
2
a
60
2
b
0
2
b
-10
;
run;
The
dataset
that
we
have
create
is
already
sorted,
hence
we
can
simply
use
"by"
statement
without
really
sorting
this.
When
we
use
"by"
statement;
"first."
and
"last."
will
create
temporary
variables
which
take
values
"1"
and
"0"
for
each
observation
depending
on
groups
created
by
variables
used
in
"by
statement".
Lets
look
at
this
example
given
below
to
understand
this
better:
data
example;
set
example;
by
grps;
first_grps=first.grps;
last_grps=last.grps;
run;
data
example1;
set
example;
by
grps
section;
first_section=first.section;
last_section=last.section;
run;
In
the
first
program
we
used
"by
grps",
the
variable
"grps""
creates
two
groups
in
the
data,
one
for
the
value
"1"
and
another
for
the
value
"2".
The
variable
"first."
takes
value
"1"
for
the
first
observation
in
the
groups
and
"0"
for
others,
on
the
other
hand
"last."
variable
takes
value
"1"
for
the
last
observation
in
the
group
and
"0"
for
others.
In
the
second
program
we
used
"by
grps
section",
this
makes
more
groups
in
the
data,
first.
and
last.
takes
values
"1"
and
"0"
accordingly.
We
don't
really
need
to
create
these
first.
and
last.
variables
to
use
them,
in
the
programs
above
we
created
those
just
for
demonstration.
Lets
use
them
to
solve
a
similar
problem
which
we
did
for
the
bank_transaction
data.Lets
get
the
top
score
for
each
section.
proc
sort
data=example;
by
grps
section
descending
score;
run;
data
top_example;
set
example;
by
grps
section;
if
first.section;
run;
In
a
similar
fashion
,
we
can
solve
the
original
problem
that
we
solved
for
dataset
bank_transactions:
proc
sort
data=dp.bank_transactions;
by
year
month
dc
descending
amount;
run;
Numeric
Functions
Before
we
start
to
learn
about
SAS
functions,
lets
learn
about
a
way
to
"not"
create
a
dataset
every
time
we
just
want
to
see
what
a
function
does.
Handy
way
is
to
name
my
outgoing
dataset
simply
"null"
,
this
tells
sas
not
to
create
any
dataset
in
the
data
step
program.
But
we
do
need
something
which
will
show
us
the
result
of
the
function
that
we
just
used.
"put"
statement
comes
to
rescue.
Put
statement
prints
whatever
we
ask
it
to
,
in
the
log.
Remember
,
not
in
the
result
window
but
in
the
log
window.
Lets
look
at
few
numeric
functions
available
in
the
SAS
system:
data
_null_;
x=sqrt(2000000);
y=log(x);
z=sum(23,34,56);
put
x;
put
y;
put
z;
run;
There
are
several
such
numeric
functions.
A
longer
list
can
be
found
here
:
http://support.sas.com/documentation/cdl/en/imlug/59656/HTML/default/view
er.htm#langref_sect321.htm
a
quick
list
that
comes
to
mind
is
this
:
log,
exp,
sqrt,
mean,
median,
sum,
n,
nmiss.
These
functions
do
what
the
name
sounds
like.
That
also
is
not
really
an
exhaustive
list.
In
fact
you
can
find
almost
all
direct
mathematical
formulas
that
you
use
in
the
SAS
function
list
if
you
look
for
the
documentation.
We'll
not
be
going
through
all
the
function.
One
important
thing
however
is
to
understand
that
data
processing
happens
in
SAS
row
by
row
not
column
by
column
lets
create
a
data
set
and
understand
how
these
functions
work
row
by
row
;
not
column
by
column
.
data
func;
input
x
y
z;
cards;
10
20
30
1
2
3
5.4
6.7
9.33
100
200
0
;
run;
now
lets
apply
some
numerical
functions
and
see
what
they
do.
data
func;
set
func;
s1=sum(x);
s2=sum(x,y,z);
run;
You
would
notice
that
the
variable
"s1"
above
is
not
containing
sum
of
the
entire
column
x.
In
fact
it
is
rather
containing
values
exactly
same
as
x.
why?
,
because
these
functions
only
work
on
rows
,
not
on
columns.
So
in
the
same
row,
there
is
only
one
value
of
x
to
be
summed,
and
the
result
is
just
x.
Now
on
the
other
hand,
"s2"
is
sum
of
values
of
variables
x,y
and
z
in
the
same
row.
Note:
you
must
be
wondering
,
why
do
we
need
a
function
for
sum
when
we
can
use
the
algebraic
sign
"+"
for
the
same
purpose.
Well,
there
is
a
small
difference.
When
function
sum
encounters
a
missing
value
while
performing
addition,
it
ignores
it,
where
as
if
that
happens
while
using
"+"
operator
,
you'll
get
a
missing
value
as
the
result.
Lets
see
an
example:
data
_null_;
x=sum(10,20,30,.);
y=10+20+30+.;
put
x;
put
y;
run;
String
Functions
We
saw
that
most
of
the
numeric
functions
are
simply
named
as
their
mathematical
names.
These
names
readily
make
sense
and
tell
what
do
we
use
these
functions
for.
Same
is
not
the
case
for
string
functions,
or
functions
which
are
used
to
process
character
variables.
We'll
talk
about
few
important
character
functions
in
detail
with
example.
scan
This
function
takes
a
string
as
input
.
Imagine
a
scenario
where
this
input
string
is
an
address
with
elements
of
it
such
as
home
number,
street
,
city
etc
are
separated
by
"/".
Third
input
scan
function
is
this
"delimiter"
which
separates
different
elements
of
the
string
within
it.
Second
input
is
the
element
which
you
want
to
extract
from
the
string.
For
example
we
have
this
address:
"1502/Panch
Mahal/Malad/Mumbai"
And
we
want
to
extract
suburb
name
from
this
address
which
is
the
second
element
if
we
consider
"/"
to
be
the
delimiter
in
the
string.
Lets
see:
data
_null_;
address="1502/Panch
Mahal/Malad/Mumbai";
suburb=scan(address,2,"/");
put
suburb;
run;
Explore
Yourself:
Can
we
use
multiple
delimiters
with
scan?
substr
Function
substr
can
be
used
to
extract
a
substring
from
a
larger
string
if
we
know
position
of
start
and
end
of
the
said
substring
in
the
larger
input
string.
Keep
in
mind
that
counting
start
with
one
not
zero
as
seen
in
other
programming
languages.Here
are
few
examples
for
the
same:
data
_null_;
IP="192.168.1.1:543";
port=substr(IP,5,3);
put
port;
run;
data
_null_;
IP="192.168.1.1:543,AutomatedMails";
port=substr(IP,13);
port1=substr(IP,13,3);
put
port;
put
port1;
run;
Explore
Yourself:
What
happens
if
we
give
input
for
end
position
in
the
function
substr?
trim
,
strip
,
||
,catx,compress
Functions
named
above
and
operator
||
are
used
remove
white
spaces[
trim
,strip,compress]
from
the
input
string
in
various
ways
and
combining
them
[||,
catx].
We'll
learn
through
some
examples:
data
_null_;
x="Lalit";
y="Sachan";
z=x||y;
m=x||"-7@"||y;
put
z;
put
m;
run;
You
can
see
that
operator
||
[this
is
double
pipe
symbol]
simply
combines
strings.
Lets
look
at
white
space
removing
functions
and
peculiarities
associated
with
them.
data
_null_;
x=trim("
Lalit
");
y=trim("
Sachan
");
z="@"||x||"@"||y||"@";
x_l=length(x);
y_l=length(y);
put
x_l;
put
y_l;
put
z;
run;
You
can
see
that
in
above
example
none
of
the
spaces
get
removed.
This
is
a
peculiar
behavior
of
the
function
trim
.
If
you
use
function
trim
the
variable
value
assignment
directly
then
only
it
works.
It
removed
trailing
spaces
from
the
string.:
data
_null_;
x="
Lalit
";
y="
Sachan
";
z="@"||trim(x)||"@"||trim(y)||"@";
put
z;
run;
now
lets
look
at
how
strip
behaves.
We
are
using
length
function
to
check
if
trim/strip
functions
are
working
,
in
addition
to
printing
them
in
log
using
"put"
function.
data
_null_;
x=strip("
Lalit
");
y=strip("
Sachan
");
z="@"||x||"@"||y||"@";
put
z;
run;
As
opposed
to
trim
function
,in
the
above
example
strip
is
removing
leading
spaces
,
let
see
how
it
behaves
when
used
directly
during
new
variable
creation.
data
_null_;
x="
Lalit
";
y="
Sachan
";
z="@"||strip(x)||"@"||strip(y)||"@";
put
z;
run;
in
this
case
it
removes
all
[not
the
ones
in
between]
the
spaces,
leading
and
trailing
*/
compress
This
function
removes
all
spaces
from
the
string
,
including
the
ones
which
are
in
between.
data
_null_;
x="
Lalit
Sachan
";
z="@"||compress(x)||"@";
put
z;
run;
catx
This
function
concatenates
strings
after
removing
leading
and
trailing
spaces
from
them.
First
argument
however
here
is
the
delimiter
which
will
be
used
while
combining
the
strings.
If
any
of
the
strings
to
be
combined
are
simply
white
spaces
they
are
ignored.
Here
is
an
example
to
make
you
understand
better.
Notice
how
to
white
space
is
simply
ignored,
while
creating
y.
In
both
the
cases
"$"
has
been
used
a
delimiter.
data
_null_;
x=catx("$","
45
","
ytfy
","asdf
");
y=catx("$","
xd
",
"
","dr
");
put
x;
put
y;
run;
Explore
Yourself:
Find
out
what
functions
"upcase"
and
"lowcase"
do?
Come
up
with
a
functioning
example.
find
This
function
is
used
to
find
the
starting
position
of
a
smaller
substring
in
a
larger
input
string.
Remember
that
counting
start
with
one
from
the
beginning
of
the
string.
The
first
argument
to
function
is
the
larger
string
where
we
aim
to
find
the
smaller
one.
Second
argument
is
the
string
which
we
are
looking
for
in
the
larger
one.
Third
argument
is
where
we
should
start
in
the
larger
string
to
look
for
the
smaller
one.
If
that
number
is
"+ve"
then
search
is
done
from
left
to
right,
if
that
number
is
negative
,
search
is
done
from
right
to
left.
However
returned
value
is
the
starting
position
of
the
smaller
string
from
the
beginning
of
the
larger
string
only.
if
third
argument
is
left
blank,
then
by
default
search
starts
at
the
beginning
of
the
string
and
is
done
left
to
right.Also
note
that
if
there
are
multiple
occurrences
of
the
smaller
strings,
the
starting
position
of
that
occurrence
is
returned
which
is
encountered
first
depending
on
starting
position
and
direction
of
the
search
as
specified
by
various
inputs
of
the
function
Below
given
here
are
few
examples:
data
_null_;
x="akjs@askj@asdkf@a";
z=find(x,"@a");
m=find(x,"@a",7);
k=find(x,"@a",-17);
a=find(x,"@a",-7);
b=find(x,"@a",17);
put
z;
put
m;
put
k;
put
a;
put
b;
run;
Search
here
by
default
is
case
sensitive
as
can
be
seen
in
the
example
below.
"s"
is
not
found
because
the
letter
"S"
is
in
caps
in
the
larger
string.
data
_null_;
x="SjdksdA";
y=FiNd(x,"s");
put
y;
run;
If
you
want
your
search
to
be
case
insensitive,
you
need
to
use
the
identifier
"i".
The
first
and
second
arguments
are
meant
for
strings
to
be
searched
in
and
strings
to
be
searched
for
.
Beyond
that
"i"
means
identifier
i
which
makes
your
search
case
insensitive.
data
_null_;
x="akjs@askIj@asdkf@a";
z=find(x,"@A");
m=find(x,"@A","i",7);
n=find(x,"i",7,"i");
put
m;
put
z;
put
n;
run;
Explore
Yourself:
What
does
the
identifier
"t"
do
in
the
function
"find"?
Tranwrd
This
function
is
used
to
replace
substring
occurrences
in
the
larger
input
string.
In
the
example
given
below
we
are
replacing
all
hyphens
with
"/"
.
Second
argument
is
what
we
want
to
replace
and
the
third
is
what
we
want
to
replace
it
with.
Of
course
first
argument
being
the
string
where
we
want
to
do
these
replacements.
data
_null_;
address="1203-Some
Tower-powai/Mumbai";
proper_add=tranwrd(address,"-","/");
put
proper_add;
run;
Here
is
an
exercise.
Run
the
code
given
below
to
create
the
dataset.:
data
Add;
length
address
$40;
input
address
$;
cards;
1604-some-chandiwali,Mumbai
12-a/Delhi
First-Street,Chennai
;
run;
Once
that
is
done.
Create
a
column
in
the
dataset
which
contains
city
names
extracted
from
these
address.
Do
that
using
whatever
functions
you
think
are
going
to
be
appropriate
for
the
process.
Exercise
Solution:
run;
/*"
In
the
data
set
temp
above,
x
is
essentially
a
string
as
can
be
confirmed
by
looking
at
its
type,
now
we
can
apply
a
date
format
on
this
to
create
another
variable
which
contains
the
same
values
but
"
data
temp;
set
temp;
format
y
mmddyy10.;
y=input(x,ddmmyy10.);
put
y;
run;
If
you
look
at
type
of
variable
"some"
in
the
data
temp,
it
is
character.
Lets
convert
that
to
numeric
variable.
data
temp;
set
temp;
some_num=input(some,8.);
run;
smallest
,
largest
Function
min
and
max
always
give
largest
and
smallest
value
,
however
at
times
we
might
need
n!"
largest
or
smallest
value
among
many.
For
that
we
can
use
smallest
or
largest
functions.
First
argument
to
these
function
is
the
value
of
"n".
Example
given
below
get
3rd
largest
and
3rd
smallest
values
from
the
data
respectively.
data
_null_;
x=smallest(3,23,1,4,-5,7,0,10);
y=largest(3,23,1,4,-5,7,0,10);
put
x;
put
y;
run;
Lag
Since
by
default
SAS
processes
data
row
by
row,
there
is
no
direct
method
to
access
previous
observations
in
data
step.
For
doing
so
we
have
to
use
lag
function
which
is
designed
do
specifically
this:
data
temp;
input
A
$
B
C;
cards;
truck
10
1
truck
20
2
truck
30
3
car
40
4
car
50
5
car
60
6
;
run;
data
temp;
set
temp;
D=lag(B);
run;
You
can
see
that
new
variable
"D"
is
simply
take
previous
values
of
variable.
Or
in
other
words
its
equivalent
to
column
"B"
with
one
lag.
You
can
apply
lag
function
with
multiple
lags
too
by
using
function
lagn.
Following
is
an
example
with
lag3.
data
temp;
set
temp;
D=lag3(B);
run;
However
this
gets
tricky
if
you
use
the
function
lag
inside
a
condition.
In
that
case
lag
function
returns
only
those
values
which
it
gets
to
see
within
the
condition
block.
Here
is
is
example.
Try
to
understand
this
and
if
doesn't
make
sense
ask
for
a
detailed
explanation
in
the
class:
proc
sort
data=temp;
by
A;
run;
data
temp;
set
temp;
by
A;
new_var=first.A;
if
first.A
then
D=lag(B);
else
D=lag(C);
run;
Round
Round
function
is
used
to
round
off
digits
for
numeric
values.
First
argument
is
the
value
being
rounded
off
and
second
argument
is
indicator
for
the
rounding.
data
_null_;
x=123.45567;
y=round(x);
z=round(x,0.001);
put
z;
put
y;
run;
in
the
above
example
,
second
input
is
.001
which
means
x
will
rounded
off
up
to
3rd
digit
after
decimal.
You
can
consider
the
process
like
this.
First
x
is
divided
by
.001,
rounded
off
to
nearest
integer
and
then
multiplied
by
.001.
So
x/.001
=
123455.67,
this
being
rounded
off
to
nearest
integer
becomes
123456
this
again
gets
multiplied
by
.001
and
becomes
123.456
lets
take
few
more
examples:
data
_null_;
x=123.45567;
y=round(x,0.1);
z=round(x,100);
m=round(x,10);
put
m;
put
z;
put
y;
run;
consider
m=round(x,10),
first
x
gets
divided
by
100
which
becomes
12.345567
then
it
gets
rounded
off
to
nearest
integer
which
is
12,
then
it
gets
multiplied
by
10
and
becomes
120,
which
is
the
final
value
of
m.
Explore
Yourself:
Do
the
above
the
process
for
y
and
z
also
and
see
whether
the
final
values
match
with
what
your
calculations.
Proc
Rank
Proc
rank
is
used
to
make
bins
in
your
data.
You
can
use
a
numeric
variable
by
which
you
want
to
make
bins
in
the
data.
For
example
in
the
data
set
sashelp.cars
,
we
want
to
make
bins
in
the
data
by
variable
invoice.
What
happens
is
that
data
is
sorted
by
variable
invoice
and
then
starting
from
top
equal
numbers
of
observations
are
put
into
each
bin.
proc
rank
data=sashelp.cars
out=car_rank
group=10;
var
invoice;
ranks
basket;
run;
groups=10
tells
proc
rank
there
are
going
to
10
bins/groups
in
the
data.
"ranks
basket":
this
names
the
variable
containing
group/bin
number
as
"basket".
Bin
numbering
starts
with
0.
Proc
transpose
This
is
used
to
make
your
data
from
long
to
wide
or
wide
to
long
as
discussed
before.
Lets
create
the
same
data
which
we
showed
there
data
long1
;
input
famid
year
faminc
;
cards
;
1
96
40000
1
97
40500
1
98
41000
2
96
45000
2
97
45400
2
98
45800
3
96
75000
3
98
77000
;
run;
Following
program
using
proc
transpose
converts
the
long
format
data
into
wide:
proc
transpose
data=long1
out=wide1
prefix=year_;
by
famid
;
id
year;
var
faminc;
run;
by
statement:
makes
rows
based
on
how
many
unique
values
the
specified
variable
in
the
by
statement
has
id
statement:
makes
columns
based
how
many
unique
values
the
specified
variable
in
the
id
statement
has
var
statement
:
fills
the
values
of
variable
specified
in
the
var
statement
in
the
resulting
cells
of
transposed
dataset.
If
some
cells
don't
have
a
corresponding
values
in
the
incoming
dataset
they
are
assigned
missing
values
such
as
cell
corresponding
to
year
97
and
famid
3
in
the
above
example.
Now
next
question
that
might
be
bothering
you
must
be
what
happens
if
there
are
more
than
one
variables
to
filled
in,
you
simply
get
multiple
rows
corresponding
to
each
value
of
variable
in
"by
statement".
For
example
in
the
example
given
below
you
get
2
rows
for
each
famid.
data
long2;
input
famid
year
faminc
spend
;
cards;
1
96
40000
38000
1
97
40500
39000
1
98
41000
40000
2
96
45000
42000
2
97
45400
43000
2
98
45800
44000
3
96
75000
70000
3
97
76000
71000
3
98
77000
72000
;
run
;
Proc
Format
Proc
format
is
used
to
create
user
defined
format.
This
does
not
require
any
input
from
a
dataset
and
create
format
can
be
applied
on
any
variable
in
any
dataset.
Here
is
an
example
given
below.
Also
it
does
not
change
underlying
format
of
the
variable,
it
only
changes
how
it
is
displayed.
proc
format;
value
$jc
'one'='Management'
'two'='Trainees';
value
Grade
0-32="F"
33-45="C"
46-58="B"
60-100="A";
run;
"value"
statement
here
is
the
one
which
essentially
creates
the
format
for
you.
If
this
format
is
going
to
be
*applied
on
on
character
values
then
the
format
name
starts
with
a
"$"
sign
otherwise
the
name
starts
as
usual.
Naming
constraints
for
formats
is
same
as
variable
names.
in
the
value
statement
given
above
we
created
format
$jc,
if
we
apply
it
on
a
categorical
variable
and
the
value
is
"Management"
then
displayed
value
will
be
'one'
and
'two'
if
the
value
is
"Trainees".
If
the
value
does
not
match
with
either
of
the
"Management"
or
"Trainee"
then
value
will
displayed
as
is.
For
the
numeric
format
Grade
,
if
the
numeric
variable
on
which
it
is
being
applied,
is
in
the
range
0-32
then
"F"
will
be
displayed,
if
any
of
the
values
does
not
match
with
the
given
ranges
then
a
*
will
be
displayed
in
its
place.
Lets
see
an
example
of
these
formats
being
applied
on
the
data
set
temp.
To
emphasize
that
the
underlying
values
don't
change
i
have
also
created
a
numeric
variable
in
the
same
data
step.
data
temp;
input
jobs
$
marks;
cards;
one
10
two
75
one
34
two
59
abc
79
one
49
one
56
two
90
abc
20
;
run;
data
temp;
set
temp;
format
jobs
$jc.;
format
marks
grade.;
marks2=marks/2;
run;
Proc
SQL
This
is
implementation
of
SQL
language
with
in
SAS.
All
of
the
tasks
which
we'll
see
here
can
be
achieved
with
whatever
we
have
learned
so
far.
SQL
language
queries
are
however
at
times
easy
to
read
and
write.
But
do
not
use
them
with
large
dataset.
They
might
not
be
as
fast
as
their
data
step
counterparts.
You
will
see
that
SQL
queries
are
very
English
like
to
write.
They
are
mostly
used
to
subset,summarize
and
pre-process
the
data.
There
are
no
predictive
modeling
procedures
in
SQL
framework.
We'll
see
that
all
SQL
queries
are
just
select
statements.
These
select
statements
have
incremental
capacities
which
we'll
see
starting
with
the
simplest
form
where
you
select
all
the
observation
from
the
incoming
dataset.
All
SQL
queries
are
going
to
be
in
a
block
starting
with
"proc
sql"
and
closed
with
"quit".
Result
of
the
selection
will
be
displayed
in
result
window.
If
we
want
to
put
the
result
of
selection
in
a
data
set
we
can
simple
add
"create
table
as
table_name
"
in
front
of
the
select
statement.
Lets
see
some
example
for
the
same.
proc
sql;
select
*
from
sashelp.cars;
quit;
All
obs
are
still
displayed
but
a
table
named
"lalit"
is
created
in
the
work
library
[you
can
supply
a
lib
ref
for
it
to
be
createdin
some
other
location]
with
all
the
observations.
Here
on
wards
we'll
not
use
create
table,
whenever
you
want
to
do
that
,
simply
add
that
part
in
front
of
select
statement.
If
you
do
not
want
to
select
columns
of
the
data
you
restrict
by
mentioning
the
variable
names
separated
by
comma.
proc
sql;
select
name,nhits
from
sashelp.baseball;
quit;
This
controls
number
of
variables/columns
which
you
are
selecting
from
the
dataset.now
what
if
i
want
to
restrict
number
of
observations
There
are
many
ways
to
do
it.
proc
sql
inobs=10;
select
name
from
sashelp.baseball;
select
make
from
sashelp.cars;
quit;
There
is
also
an
option
called
outobs.
Outobs
specifies
number
of
observation
which
go
out.
In
the
current
example
it
works
same
as
inobs
but
when
you
are
processing
data
it
behaves
differently.
proc
sql
outobs=10;
select
name
from
sashelp.baseball;
quit;
we
can
write
multiple
conditions
as
well
by
combining
them
with
and,
or
operators.
proc
sql;
create
table
temp
as
select
invoice,origin,drivetrain,type,mpg_city
from
sashelp.cars
where
origin="USA"
and
type="Sedan"
and
mpg_city>15;
quit;
Remember
that
you
don't
need
to
necessarily
select
the
variable
on
which
you
apply
conditional
statement.
Next
requirement
is
to
sort
the
data,
for
that
we'd
add
order
by
to
our
select
statement.
proc
sql;
select
invoice,origin
from
sashelp.cars
order
by
invoice;
quit;
default
order
of
sorting
is
ascending.
If
you
want
to
sort
things
in
descending
order
then
you'll
have
to
use
the
keyword
desc
as
given
below
:
proc
sql;
select
invoice,origin
from
sashelp.cars
order
by
invoice
desc;
quit;
Here
the
summary
operations
[
such
as
calculating
mean
in
the
above
example]
is
carried
out
on
the
groups
created
by
"group
by".
Here
are
few
more
examples
,
one
which
include
order
by
as
well.s
proc
sql
;
select
origin,
std(msrp)
as
price_std
from
sashelp.cars
group
by
origin;
quit;
proc
sql
;
select
make,
std(msrp)
as
price_var
from
sashelp.cars
group
by
make
order
by
price_var;
quit;
now
if
we
wanted
to
put
condition
here
on
the
new
var
which
is
created
[price_var];
lets
see
if
simple
where
condition
works
:
proc
sql
;
select
make,
std(msrp)
as
price_var
from
sashelp.cars
where
price_var>10000
group
by
make
order
by
price_var;
quit;
To
apply
conditions
on
the
variables
which
are
created
in
sql
queries
we
need
to
use
"having"
proc
sql
;
select
make,
std(msrp)
as
price_var
from
sashelp.cars
group
by
make
having(price_var>10000)
order
by
price_var
;
quit;
sequence
in
which
you
should
write
:where
>
group
by
>
having
>
order
by.
Next
we'll
see
how
to
get
data
from
multiple
tables.
libname
dp
"/folders/myfolders/Datasets/Data
Prep";
Key
is
to
give
names
to
tables
which
can
be
use
to
reference
table
while
extracting
those
columns
from
it.
We'll
try
to
solve
following
case
which
involves
getting
data
from
multiple
tables.
case:
datasets
gaming1,2,3
contain
information
on
customers
of
a
gaming
company
which
provides
online
platform
for
playing
team
games
such
as
AOE,
DOTA
,
CS
.
we
want
to
get
those
customers
ids
which
play
DOTA
on
mac
os
in
solo
sessions
with
free
license
type
and
their
average
time
per
session
is
more
than
40
minutes
We'll
give
names
to
tables
in
select
statement
only,
i
have
written
following
select
statement
in
multiple
lines
for
better
readability.
proc
sql;
select
a.gamer_id
from
dp.gaming1
as
a,
dp.gaming2
as
b,
dp.gaming3
as
c
where
b.os="mac"
and
a._game_name="dota"
and
a.atps>40
and
c.session_type="solo"
and
b.license="free"
and
a.gamer_id=b.gamer_id
and
a.gamer_id=c.gamer_id
;
quit;
data
s2;
input
id
b
$;
cards;
1
p
2
l
3
m
;
run;
proc
sql;
select
a,b
from
s1,s2;
quit;
Now
if
we
put
that
correspondence
setting
where
condition
we'll
get
the
desired
result.
proc
sql;
select
a,b
from
s1,s2
where
s1.id=s2.id;
quit;
Explore
Yourself:
*
How
to
join/merge
tables
using
SQL
*
What
do
distinct,
count
do
when
used
with
SQL
queries
We'll
conclude
here.
In
case
of
any
doubts
regarding
content
of
this
study
material,
please
post
on
QA
forum
in
LMS.
Prepared
By:
Lalit
Sachan
Contact:
lalit.sachan@edvancer.in