def find_mean(y):
"""
Quick function to find the mean of a list
Assumes we have a list with individual numeric type data elements
"""
return sum(y)/len(y)
User Defined Functions
Justin Post
Quick recap - This course is split into a few topics
- Programming in
python
- Dealing with data in
python
- Basics of Predictive Modeling
- Big Data Management
- Modeling Big Data (with
Spark
viapyspark
) - Streaming Data
We’re working through learning how to program in python
. We’ve seen - how to program through Google Colab - how to bring in modules - common data types: strings, lists, numeric types (and booleans)
We’ve also seen a bit about how to think about data. Now, we’ll focus on improving our programs a bit before we get back to handling data!
In order to get the most out of any programming language, we need to understand how to write our own functions. User-Defined functions allow us to streamline our code, simplify large sections of code, and make our code easier to generalize to other situations.
Note: These types of webpages are built from Jupyter notebooks (.ipynb
files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Function Creation Syntax
To create our own functions, we just need to
- use the keyword
def
and give the function name with arguments - tab in (four spaces) our function body (code that the function runs).
- at the top of the function body we usually add a multi-line string (via triple quotes) explaining the function purpose and arguments (called a doc string)
- we use
return
to return an object
def function_name(arg1, arg2, arg3 = default_arg3):
"""
Documentation string
"""
Function body
return object
Write Our Own Mean Function
We discussed common tasks for data. Of course one was simply describing a data set that we have. One way to describe the center of a numeric variable’s distribution is through the sample mean.
- Given data points labeled as \(y_1, y_2, ..., y_n\) (\(n\) is the number of observations), the sample mean is
\[\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i\]
Let’s write a function to calculate the mean of a list
of numbers using the sum()
and len()
functions.
Now let’s apply our function to a list
of numeric values. We can create a sequence of values using the range()
function. This function takes two arguments, the starting point and the ending point (which isn’t included).
range()
itself is an immutable iterable type object. It isn’t the values themselves but an object that can be used to create the values. In the case of range()
it can be described as a lazy list. We’ll discuss iterators more shortly.
One way to get the range()
object to create its values is by running list()
on it. This tells python to iterate over the range()
object and produce the numbers.
= range(0,11) #same as range(11)
seq #doesn't show values seq
range(0, 11)
list(seq)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
list(seq)) find_mean(
5.0
Iterators (and iterator type objects) are often used to save memory as you often don’t need the entire sequence, but do want to use them in some kind of order.
By iterating across the elements and not saving the entire object, we can save memory. We only need to know where we are on the iteration and how the iteration should be done!
Add a Default Argument
Often we want to give default arguments to our function. That is, arguments that are used unless the user specifies something else.
- Suppose we want to add in a trimmed mean functionality
- This is a mean where we first remove the smallest p% of values and the largest p% of values. We then take the mean of the remaining numbers.
- A trimmed mean is more robust to outliers. For instance,
1,2,3,4,5,100]) #the mean is greatly affected by the large value find_mean([
19.166666666666668
1,2,3,4,5]) #remove the large value to get a better idea about 'most' of the data values find_mean([
3.0
To create a trimmed mean function (or option at least), we need to do the following:
- Sort the observations
- Remove the lowest p% and highest p%
- Find mean on the remaining values
#can pull in the floor and sqrt functions from math to help us out
from math import floor, sqrt
#generate 50 random values from the standard normal distribution (covered shortly)
import numpy as np
= np.random.default_rng(1).standard_normal(50)
y #convert to a list just so we are working with an object we've studied
= list(y)
y 0:10] y[
[0.345584192064786,
0.8216181435011584,
0.33043707618338714,
-1.303157231604361,
0.9053558666731177,
0.4463745723640113,
-0.5369532353602852,
0.5811181041963531,
0.36457239618607573,
0.294132496655526]
Note that lists have a .sort()
method but this modifies the list in place. Instead we can use the sorted()
function which returns a new sorted version of the list.
= sorted(y)
sort_y print(sort_y[0:10])
[-2.7111624789659685, -1.8890132459676727, -1.6480751708556527, -1.303157231604361, -1.2273520542445742, -1.1120207626922813, -0.9447516230607774, -0.7819084623568421, -0.7364540870016669, -0.6832266617805622]
Now, given a value of p, we can remove the lowest and high p% of values. We can do this with the floor()
function. This gives us the largest interger below a given value.
print(floor(4))
print(floor(4.2))
print(floor(4.9))
4
4
4
Given a p (for proportion) we can determine the number of observations corresponding to that proportion using the length of y
.
= 0.11
p print(p*len(sort_y))
= floor(p*len(sort_y))
to_remove to_remove
5.5
5
We can remove observations by simply subsetting our list using the :
operator we studied (slicing). Remember that this operator doesn’t include the last value. (i.e. 2:5
gives the 2
, 3
, and 4
values)
print([to_remove, len(sort_y)-to_remove])#values we want to keep are between these
[5, 45]
- Remember, counting starts at 0
- We want the remove the first 5 values so we should start with the 5th index (the 6th actual value!)
- With a length 50 list, we want to remove the 46-50th elements which correspond to the 45-49 indices
- Since we don’t include our last index, we can end on 45
#elements we want for a 11% trimmed mean
len(sort_y)-to_remove)] sort_y[to_remove:(
[-1.1120207626922813,
-0.9447516230607774,
-0.7819084623568421,
-0.7364540870016669,
-0.6832266617805622,
-0.5369532353602852,
-0.5140063716874629,
-0.5062916583143148,
-0.48211931267997826,
-0.42219041157635356,
-0.37760500712699807,
-0.2924567509650886,
-0.2756029052993704,
-0.2571922406188707,
-0.17477209205516195,
-0.16290994799305278,
-0.09826996785221727,
-0.07204367972722743,
0.008142180518343508,
0.02842224131579679,
0.03558623705548571,
0.03972210748165899,
0.09548302746945433,
0.10901408782154753,
0.16746474422274113,
0.2136429974986111,
0.21732193102256359,
0.294132496655526,
0.33043707618338714,
0.345584192064786,
0.36457239618607573,
0.4463745723640113,
0.5467129866124469,
0.5811181041963531,
0.5937480717858228,
0.5988462126346276,
0.6467029962018469,
0.6630633723762617,
0.8216181435011584,
0.8911669542823284]
Modify the function arguments
Now that we have the process down (this is a good way to write functions by the way, write them outside of a function first and then put the pieces into the function), we can add our arguments/calculations.
We’ll add a - method =
argument with a default value of None
. None
is a special name that defines no value in python
+ If this argument takes on Trim
, we’ll do a trimmed mean. + This can be done using if Boolean:
with the resulting code to execute tabbed in four spaces (covered shortly!) - a p =
argument to specify the proportion to remove with a default value set to 0.
def find_mean(y, method = None, p = 0):
"""
Quick function to find the mean
Assumes we have a list with only numeric type data
If method is set to Trim, will remove outer most p values off the data
before finding the mean
"""
if method == "Trim": #we'll cover if shortly! The indented code only runs if this condition is met
= sorted(y)
sort_y = floor(p*len(sort_y))
to_remove = sort_y[to_remove:(len(sort_y)-to_remove)] #replace y with the modified version
y return sum(y)/len(y)
Let’s test the function!
= "Trim", p = 0) #usual mean find_mean(y, method
-0.03607807742830818
= "Trim", p = 0.05) #5% trimmed mean find_mean(y, method
-0.029659532804894563
= "trim", p = 0.05) #usual mean not trimmed if method is not set correctly find_mean(y, method
-0.03607807742830817
Positional vs Named Arguments
- A function can be called using positional or named args
#def find_mean(y, method = None, p = 0):
print(find_mean(y, None))
print(find_mean(method = "Trim", p = 0.1, y = y))
print(find_mean(y, "Trim", 0.1))
-0.03607807742830817
-0.009797451217442077
-0.009797451217442077
- You can’t place positional args after a keyword though!
= x, "Trim") #throws an error find_mean(y
File "<ipython-input-20-39dc4eceb262>", line 1 find_mean(y = x, "Trim") ^ SyntaxError: positional argument follows keyword argument
Defining the Type of Argument
- A function definition may look like:
def f(pos1, pos2, /, pos_or_kwd, *, kwd1, kwd2):
----------- ---------- ----------
| | |
| Positional or keyword |
| - Keyword only
-- Positional only
def print_it(x, y, /):
print("Must pass x and y positionally!" + x + y)
def print_it(x, /, y):
print("x must be passed positionally. y can be positional or named" + x + y)
def print_it(x, /, y, *, z):
print("Now z must be passed as a named argument" + x + y + z)
Let’s modify our mean function and show this.
#with this, y must be passes positionally!
def find_mean(y, /, method = None, p = 0):
"""
Quick function to find the mean
Assumes we have a list with only numeric type data
If method is set to Trim, will remove outer most p values off the data
before finding the mean
"""
if method == "Trim": #we'll cover if shortly! The indented code only runs if this condition is met
= sorted(y)
sort_y = floor(p*len(sort_y))
to_remove = sort_y[to_remove:(len(sort_y)-to_remove)] #replace y with the modified version
y return sum(y)/len(y)
"Trim", p = 0.1) find_mean(y,
-0.009797451217442077
= y, method = "Trim", p = 0.1) #this won't work! find_mean(y
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-24-665a7ded1b54> in <cell line: 1>() ----> 1 find_mean(y = y, method = "Trim", p = 0.1) #this won't work! TypeError: find_mean() got some positional-only arguments passed as keyword arguments: 'y'
Write Our Own Correlation Function
Just to demonstrate something more complicated, let’s write our own function to compute the (usual) sample correlation between two variables, call them x
and y
.
- Pearson’s correlation:
\[r = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]
where - \((x_i, y_i)\) are numeric variables observed on the same \(n\) units, \(i=1,...,n\)
Plan
Function inputs: - \(x\), \(y\), lists with numeric entries only
Function body: - Find sample means for \(x\) and \(y\) - Compute numerator sum and denominator sums - Find quotient and return that value
Finding Means
Let’s create some example data. \(x\) and \(y\) won’t be related here so the sample correlation shoudl be near 0!
= list(range(1,51))
x print(x[1:10])
= find_mean(x)
xbar xbar
[2, 3, 4, 5, 6, 7, 8, 9, 10]
25.5
#use same y as before
= list(np.random.default_rng(1).standard_normal(50))
y print(y[1:10])
= find_mean(y)
ybar ybar
[0.8216181435011584, 0.33043707618338714, -1.303157231604361, 0.9053558666731177, 0.4463745723640113, -0.5369532353602852, 0.5811181041963531, 0.36457239618607573, 0.294132496655526]
-0.03607807742830817
Agiain, these two vectors are not related and should have a near 0 correlation!
Next, we need to find the numerator and denominator sums. Finding the sums will be easier once we learn arrays, but for now we’ll peak at a for
loop and the zip()
function.
Let’s start with computation of \[\sum_{i=1}^n(x_i-\bar{x})^2\]
#computation in one of our sums (we want this across all 50 values, then added up)
0]-xbar)**2 (x[
600.25
So really we want to find all of these values:
(x[0]-xbar)**2
(x[1]-xbar)**2
...
(x[49]-xbar)**2
We can use for
to iterate over the values of 0, 1, …, 49. Similar to function definitions and if
statements, we just tab in (four spaces) the code to be executed at each iteration of the for loop.
#initialize a value to store the sum in
= 0
den_x #use a for loop to iterate across values (studies more later!)
for i in x:
+= (i-xbar)**2
den_x den_x
10412.5
We can very easily get a similar computation for \(y\)’s portion of the denominator.
To get the numerator, that’s a bit more work. We really need to find
(x[0]-xbar)(y[0]-ybar)
(x[1]-xbar)(y[1]-ybar)
...
(x[49]-xbar)(y[49]-ybar)
We can zip()
the \(x\) and \(y\) lists together. This essentially just pairs the 0th elements, the 1st elements, etc. Then we can iterate over the values together.
= 0
num for i, j in zip(x, y): #i corresponds to the x elements and j the y elements
+= (i-xbar)*(j-ybar)
num num
-51.69981003655184
Ok, now we are ready to put these together and calculate our correlation!
def find_corr(x, y):
"""
Compute Pearson's Correlation Coefficient
x and y are assumed to be lists with numeric values
Data is assumed to have no missing values
"""
= find_mean(x)
xbar = find_mean(y)
ybar = 0
num = 0
den_x = 0
den_y for i, j in zip(x, y):
+=(i-xbar)*(j-ybar)
num +=(i-xbar)**2
den_x +=(j-ybar)**2
den_y return num/sqrt(den_x*den_y)
Let’s test our function on our data!
#near 0! find_corr(x, y)
-0.0813179110596017
Note that all functions with a doc string have a .__doc__
attribute that you can look at to understand that function (assuming the doc string is useful!).
print(find_corr.__doc__)
Compute Pearson's Correlation Coefficient
x and y are assumed to be lists with numeric values
Data is assumed to have no missing values
print(len.__doc__) #another example on a common function
Return the number of items in a container.
print(np.random.default_rng.__doc__) #another example
default_rng(seed=None)
Construct a new Generator with the default BitGenerator (PCG64).
Parameters
----------
seed : {None, int, array_like[ints], SeedSequence, BitGenerator, Generator}, optional
A seed to initialize the `BitGenerator`. If None, then fresh,
unpredictable entropy will be pulled from the OS. If an ``int`` or
``array_like[ints]`` is passed, then it will be passed to
`SeedSequence` to derive the initial `BitGenerator` state. One may also
pass in a `SeedSequence` instance.
Additionally, when passed a `BitGenerator`, it will be wrapped by
`Generator`. If passed a `Generator`, it will be returned unaltered.
Returns
-------
Generator
The initialized generator object.
Notes
-----
If ``seed`` is not a `BitGenerator` or a `Generator`, a new `BitGenerator`
is instantiated. This function does not manage a default global instance.
See :ref:`seeding_and_entropy` for more information about seeding.
Examples
--------
``default_rng`` is the recommended constructor for the random number class
``Generator``. Here are several ways we can construct a random
number generator using ``default_rng`` and the ``Generator`` class.
Here we use ``default_rng`` to generate a random float:
>>> import numpy as np
>>> rng = np.random.default_rng(12345)
>>> print(rng)
Generator(PCG64)
>>> rfloat = rng.random()
>>> rfloat
0.22733602246716966
>>> type(rfloat)
<class 'float'>
Here we use ``default_rng`` to generate 3 random integers between 0
(inclusive) and 10 (exclusive):
>>> import numpy as np
>>> rng = np.random.default_rng(12345)
>>> rints = rng.integers(low=0, high=10, size=3)
>>> rints
array([6, 2, 7])
>>> type(rints[0])
<class 'numpy.int64'>
Here we specify a seed so that we have reproducible results:
>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> print(rng)
Generator(PCG64)
>>> arr1 = rng.random((3, 3))
>>> arr1
array([[0.77395605, 0.43887844, 0.85859792],
[0.69736803, 0.09417735, 0.97562235],
[0.7611397 , 0.78606431, 0.12811363]])
If we exit and restart our Python interpreter, we'll see that we
generate the same random numbers again:
>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> arr2 = rng.random((3, 3))
>>> arr2
array([[0.77395605, 0.43887844, 0.85859792],
[0.69736803, 0.09417735, 0.97562235],
[0.7611397 , 0.78606431, 0.12811363]])
Attributes are another important thing we’ll learn about, especially when we get into pyspark
. We now have
functions()
which go prior to the object.methods()
that go on the end of the object
and
.attributes
that also go on the end of an object just with no()
.
Other Things to Note
- When executing a function, a new symbol table is used for the local variables
- This keeps us from accidentally overwriting something
import numpy as np
= np.array(range(1,11))
y
def square(z):
= z**2
y print("In the function environment, z = " + str(z) + " and y = " + str(y))
return(y)
print(square(y))
print(y) #y is not changed
In the function environment, z = [ 1 2 3 4 5 6 7 8 9 10] and y = [ 1 4 9 16 25 36 49 64 81 100]
[ 1 4 9 16 25 36 49 64 81 100]
[ 1 2 3 4 5 6 7 8 9 10]
print(z) #z isn't defined outside the function call! error
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-36-7d7ac3dfdf36> in <cell line: 1>() ----> 1 print(z) #z isn't defined outside the function call! error NameError: name 'z' is not defined
- You can define global variables from within a function using
global
def square(z):
global y
= z**2
y print("In the function environment, z = " + str(z) + " and y = " + str(y))
return(y)
print(square(y))
print(y) #modified globally now
In the function environment, z = [ 1 2 3 4 5 6 7 8 9 10] and y = [ 1 4 9 16 25 36 49 64 81 100]
[ 1 4 9 16 25 36 49 64 81 100]
[ 1 4 9 16 25 36 49 64 81 100]
- If nothing is returned from a function (with
return
) then it actually returns the specialNone
def square_it(a):
if (type(a) == int) or (type(a) == float):
return a**2
else:
return
print(square_it(10))
print(square_it(10.5))
print(square_it("10"))
100
110.25
None
Default values are only evaluated once - at the time of the function definition
Mutable objects can cause an issue! (Lists are mutable as they can be changed, some objects, like tuples, are immutable and can’t be modified.)
#append a value to a list but give a default empty list if not given
def my_append(value, L = []):
L.append(value)return L
#correctly appends "A" to the list
print(my_append("A"))
#appends "B" to the previous list as L = [] was only evaluated at the time the function was created!
print(my_append("B"))
['A']
['A', 'B']
- To avoid this behavior, instead define the default value as
None
and take care of things within the function body
def my_append(value, L = None):
if L is None:
= []
L
L.append(value)return L
print(my_append("A"))
print(my_append("B"))
['A']
['B']
Video Demo
This quick video demo gives another example of creating our own function! Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=ae1858b3-74cf-4065-8ec7-b0f800e4f827&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720") IFrame(src
Recap
- Writing functions is super cool!
def func_name(args):
"""
Doc string
"""
bodyreturn object
Many ways to set up your function arguments and to call your function
Even more on function writing will be covered later!
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!