Published

2025-03-31

Open In Colab

User Defined Functions

Justin Post

Quick recap - This course is split into a few topics

  • Programming in python
  • Dealing with data in python
  • Basics of Predictive Modeling
  • Big Data Management
  • Modeling Big Data (with Spark via pyspark)
  • Streaming Data

We’re working through learning how to program in python. We’ve seen - how to program through Google Colab - how to bring in modules - common data types: strings, lists, numeric types (and booleans)

We’ve also seen a bit about how to think about data. Now, we’ll focus on improving our programs a bit before we get back to handling data!

In order to get the most out of any programming language, we need to understand how to write our own functions. User-Defined functions allow us to streamline our code, simplify large sections of code, and make our code easier to generalize to other situations.

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!


Function Creation Syntax

To create our own functions, we just need to

  • use the keyword def and give the function name with arguments
  • tab in (four spaces) our function body (code that the function runs).
  • at the top of the function body we usually add a multi-line string (via triple quotes) explaining the function purpose and arguments (called a doc string)
  • we use return to return an object
def function_name(arg1, arg2, arg3 = default_arg3):
    """
    Documentation string
    """
    Function body
    return object

Write Our Own Mean Function

We discussed common tasks for data. Of course one was simply describing a data set that we have. One way to describe the center of a numeric variable’s distribution is through the sample mean.

  • Given data points labeled as \(y_1, y_2, ..., y_n\) (\(n\) is the number of observations), the sample mean is

\[\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i\]

Let’s write a function to calculate the mean of a list of numbers using the sum() and len() functions.

def find_mean(y):
    """
    Quick function to find the mean of a list
    Assumes we have a list with individual numeric type data elements
    """
    return sum(y)/len(y)

Now let’s apply our function to a list of numeric values. We can create a sequence of values using the range() function. This function takes two arguments, the starting point and the ending point (which isn’t included).

range() itself is an immutable iterable type object. It isn’t the values themselves but an object that can be used to create the values. In the case of range() it can be described as a lazy list. We’ll discuss iterators more shortly.

One way to get the range() object to create its values is by running list() on it. This tells python to iterate over the range() object and produce the numbers.

seq = range(0,11) #same as range(11)
seq #doesn't show values
range(0, 11)
list(seq)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
find_mean(list(seq))
5.0

Iterators (and iterator type objects) are often used to save memory as you often don’t need the entire sequence, but do want to use them in some kind of order.

By iterating across the elements and not saving the entire object, we can save memory. We only need to know where we are on the iteration and how the iteration should be done!


Add a Default Argument

Often we want to give default arguments to our function. That is, arguments that are used unless the user specifies something else.

  • Suppose we want to add in a trimmed mean functionality
  • This is a mean where we first remove the smallest p% of values and the largest p% of values. We then take the mean of the remaining numbers.
  • A trimmed mean is more robust to outliers. For instance,
find_mean([1,2,3,4,5,100]) #the mean is greatly affected by the large value
19.166666666666668
find_mean([1,2,3,4,5]) #remove the large value to get a better idea about 'most' of the data values
3.0

To create a trimmed mean function (or option at least), we need to do the following:

  • Sort the observations
  • Remove the lowest p% and highest p%
  • Find mean on the remaining values
#can pull in the floor and sqrt functions from math to help us out
from math import floor, sqrt
#generate 50 random values from the standard normal distribution (covered shortly)
import numpy as np
y = np.random.default_rng(1).standard_normal(50)
#convert to a list just so we are working with an object we've studied
y = list(y)
y[0:10]
[0.345584192064786,
 0.8216181435011584,
 0.33043707618338714,
 -1.303157231604361,
 0.9053558666731177,
 0.4463745723640113,
 -0.5369532353602852,
 0.5811181041963531,
 0.36457239618607573,
 0.294132496655526]

Note that lists have a .sort() method but this modifies the list in place. Instead we can use the sorted() function which returns a new sorted version of the list.

sort_y = sorted(y)
print(sort_y[0:10])
[-2.7111624789659685, -1.8890132459676727, -1.6480751708556527, -1.303157231604361, -1.2273520542445742, -1.1120207626922813, -0.9447516230607774, -0.7819084623568421, -0.7364540870016669, -0.6832266617805622]

Now, given a value of p, we can remove the lowest and high p% of values. We can do this with the floor() function. This gives us the largest interger below a given value.

print(floor(4))
print(floor(4.2))
print(floor(4.9))
4
4
4

Given a p (for proportion) we can determine the number of observations corresponding to that proportion using the length of y.

p = 0.11
print(p*len(sort_y))
to_remove = floor(p*len(sort_y))
to_remove
5.5
5

We can remove observations by simply subsetting our list using the : operator we studied (slicing). Remember that this operator doesn’t include the last value. (i.e. 2:5 gives the 2, 3, and 4 values)

print([to_remove, len(sort_y)-to_remove])#values we want to keep are between these
[5, 45]
  • Remember, counting starts at 0
  • We want the remove the first 5 values so we should start with the 5th index (the 6th actual value!)
  • With a length 50 list, we want to remove the 46-50th elements which correspond to the 45-49 indices
  • Since we don’t include our last index, we can end on 45
#elements we want for a 11% trimmed mean
sort_y[to_remove:(len(sort_y)-to_remove)]
[-1.1120207626922813,
 -0.9447516230607774,
 -0.7819084623568421,
 -0.7364540870016669,
 -0.6832266617805622,
 -0.5369532353602852,
 -0.5140063716874629,
 -0.5062916583143148,
 -0.48211931267997826,
 -0.42219041157635356,
 -0.37760500712699807,
 -0.2924567509650886,
 -0.2756029052993704,
 -0.2571922406188707,
 -0.17477209205516195,
 -0.16290994799305278,
 -0.09826996785221727,
 -0.07204367972722743,
 0.008142180518343508,
 0.02842224131579679,
 0.03558623705548571,
 0.03972210748165899,
 0.09548302746945433,
 0.10901408782154753,
 0.16746474422274113,
 0.2136429974986111,
 0.21732193102256359,
 0.294132496655526,
 0.33043707618338714,
 0.345584192064786,
 0.36457239618607573,
 0.4463745723640113,
 0.5467129866124469,
 0.5811181041963531,
 0.5937480717858228,
 0.5988462126346276,
 0.6467029962018469,
 0.6630633723762617,
 0.8216181435011584,
 0.8911669542823284]

Modify the function arguments

Now that we have the process down (this is a good way to write functions by the way, write them outside of a function first and then put the pieces into the function), we can add our arguments/calculations.

We’ll add a - method = argument with a default value of None. None is a special name that defines no value in python + If this argument takes on Trim, we’ll do a trimmed mean. + This can be done using if Boolean: with the resulting code to execute tabbed in four spaces (covered shortly!) - a p = argument to specify the proportion to remove with a default value set to 0.

def find_mean(y, method = None, p = 0):
    """
    Quick function to find the mean
    Assumes we have a list with only numeric type data
    If method is set to Trim, will remove outer most p values off the data
    before finding the mean
    """
    if method == "Trim": #we'll cover if shortly! The indented code only runs if this condition is met
      sort_y = sorted(y)
      to_remove = floor(p*len(sort_y))
      y = sort_y[to_remove:(len(sort_y)-to_remove)] #replace y with the modified version
    return sum(y)/len(y)

Let’s test the function!

find_mean(y, method = "Trim", p = 0) #usual mean
-0.03607807742830818
find_mean(y, method = "Trim", p = 0.05) #5% trimmed mean
-0.029659532804894563
find_mean(y, method = "trim", p = 0.05) #usual mean not trimmed if method is not set correctly
-0.03607807742830817

Positional vs Named Arguments

  • A function can be called using positional or named args
#def find_mean(y, method = None, p = 0):
print(find_mean(y, None))
print(find_mean(method = "Trim", p = 0.1, y = y))
print(find_mean(y, "Trim", 0.1))
-0.03607807742830817
-0.009797451217442077
-0.009797451217442077
  • You can’t place positional args after a keyword though!
find_mean(y = x, "Trim") #throws an error
  File "<ipython-input-20-39dc4eceb262>", line 1
    find_mean(y = x, "Trim")
                           ^
SyntaxError: positional argument follows keyword argument

Defining the Type of Argument

  • A function definition may look like:
def f(pos1, pos2, /, pos_or_kwd, *, kwd1, kwd2):
           -----------    ----------     ----------
           |              |                  |
           |         Positional or keyword   |
           |                                 - Keyword only
           -- Positional only
def print_it(x, y, /):
    print("Must pass x and y positionally!" + x + y)

def print_it(x, /, y):
    print("x must be passed positionally.  y can be positional or named" + x + y)

def print_it(x, /, y, *, z):
    print("Now z must be passed as a named argument" + x + y + z)

Let’s modify our mean function and show this.

#with this, y must be passes positionally!
def find_mean(y, /, method = None, p = 0):
    """
    Quick function to find the mean
    Assumes we have a list with only numeric type data
    If method is set to Trim, will remove outer most p values off the data
    before finding the mean
    """
    if method == "Trim": #we'll cover if shortly! The indented code only runs if this condition is met
      sort_y = sorted(y)
      to_remove = floor(p*len(sort_y))
      y = sort_y[to_remove:(len(sort_y)-to_remove)] #replace y with the modified version
    return sum(y)/len(y)
find_mean(y, "Trim", p = 0.1)
-0.009797451217442077
find_mean(y = y, method = "Trim", p = 0.1) #this won't work!
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-665a7ded1b54> in <cell line: 1>()
----> 1 find_mean(y = y, method = "Trim", p = 0.1) #this won't work!

TypeError: find_mean() got some positional-only arguments passed as keyword arguments: 'y'

Write Our Own Correlation Function

Just to demonstrate something more complicated, let’s write our own function to compute the (usual) sample correlation between two variables, call them x and y.

  • Pearson’s correlation:

\[r = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]

where - \((x_i, y_i)\) are numeric variables observed on the same \(n\) units, \(i=1,...,n\)

Plan

Function inputs: - \(x\), \(y\), lists with numeric entries only

Function body: - Find sample means for \(x\) and \(y\) - Compute numerator sum and denominator sums - Find quotient and return that value

Finding Means

Let’s create some example data. \(x\) and \(y\) won’t be related here so the sample correlation shoudl be near 0!

x = list(range(1,51))
print(x[1:10])
xbar = find_mean(x)
xbar
[2, 3, 4, 5, 6, 7, 8, 9, 10]
25.5
#use same y as before
y = list(np.random.default_rng(1).standard_normal(50))
print(y[1:10])
ybar = find_mean(y)
ybar
[0.8216181435011584, 0.33043707618338714, -1.303157231604361, 0.9053558666731177, 0.4463745723640113, -0.5369532353602852, 0.5811181041963531, 0.36457239618607573, 0.294132496655526]
-0.03607807742830817

Agiain, these two vectors are not related and should have a near 0 correlation!

Next, we need to find the numerator and denominator sums. Finding the sums will be easier once we learn arrays, but for now we’ll peak at a for loop and the zip() function.

Let’s start with computation of \[\sum_{i=1}^n(x_i-\bar{x})^2\]

#computation in one of our sums (we want this across all 50 values, then added up)
(x[0]-xbar)**2
600.25

So really we want to find all of these values:

(x[0]-xbar)**2
(x[1]-xbar)**2
...
(x[49]-xbar)**2

We can use for to iterate over the values of 0, 1, …, 49. Similar to function definitions and if statements, we just tab in (four spaces) the code to be executed at each iteration of the for loop.

#initialize a value to store the sum in
den_x = 0
#use a for loop to iterate across values (studies more later!)
for i in x:
    den_x += (i-xbar)**2
den_x
10412.5

We can very easily get a similar computation for \(y\)’s portion of the denominator.

To get the numerator, that’s a bit more work. We really need to find

(x[0]-xbar)(y[0]-ybar)
(x[1]-xbar)(y[1]-ybar)
...
(x[49]-xbar)(y[49]-ybar)

We can zip() the \(x\) and \(y\) lists together. This essentially just pairs the 0th elements, the 1st elements, etc. Then we can iterate over the values together.

num = 0
for i, j in zip(x, y): #i corresponds to the x elements and j the y elements
    num += (i-xbar)*(j-ybar)
num
-51.69981003655184

Ok, now we are ready to put these together and calculate our correlation!

def find_corr(x, y):
    """
    Compute Pearson's Correlation Coefficient
    x and y are assumed to be lists with numeric values
    Data is assumed to have no missing values
    """
    xbar = find_mean(x)
    ybar = find_mean(y)
    num = 0
    den_x = 0
    den_y = 0
    for i, j in zip(x, y):
        num +=(i-xbar)*(j-ybar)
        den_x +=(i-xbar)**2
        den_y +=(j-ybar)**2
    return num/sqrt(den_x*den_y)

Let’s test our function on our data!

find_corr(x, y) #near 0!
-0.0813179110596017

Note that all functions with a doc string have a .__doc__ attribute that you can look at to understand that function (assuming the doc string is useful!).

print(find_corr.__doc__)

    Compute Pearson's Correlation Coefficient
    x and y are assumed to be lists with numeric values
    Data is assumed to have no missing values
    
print(len.__doc__) #another example on a common function
Return the number of items in a container.
print(np.random.default_rng.__doc__) #another example
default_rng(seed=None)
Construct a new Generator with the default BitGenerator (PCG64).

    Parameters
    ----------
    seed : {None, int, array_like[ints], SeedSequence, BitGenerator, Generator}, optional
        A seed to initialize the `BitGenerator`. If None, then fresh,
        unpredictable entropy will be pulled from the OS. If an ``int`` or
        ``array_like[ints]`` is passed, then it will be passed to
        `SeedSequence` to derive the initial `BitGenerator` state. One may also
        pass in a `SeedSequence` instance.
        Additionally, when passed a `BitGenerator`, it will be wrapped by
        `Generator`. If passed a `Generator`, it will be returned unaltered.

    Returns
    -------
    Generator
        The initialized generator object.

    Notes
    -----
    If ``seed`` is not a `BitGenerator` or a `Generator`, a new `BitGenerator`
    is instantiated. This function does not manage a default global instance.

    See :ref:`seeding_and_entropy` for more information about seeding.
    
    Examples
    --------
    ``default_rng`` is the recommended constructor for the random number class
    ``Generator``. Here are several ways we can construct a random 
    number generator using ``default_rng`` and the ``Generator`` class. 
    
    Here we use ``default_rng`` to generate a random float:
 
    >>> import numpy as np
    >>> rng = np.random.default_rng(12345)
    >>> print(rng)
    Generator(PCG64)
    >>> rfloat = rng.random()
    >>> rfloat
    0.22733602246716966
    >>> type(rfloat)
    <class 'float'>
     
    Here we use ``default_rng`` to generate 3 random integers between 0 
    (inclusive) and 10 (exclusive):
        
    >>> import numpy as np
    >>> rng = np.random.default_rng(12345)
    >>> rints = rng.integers(low=0, high=10, size=3)
    >>> rints
    array([6, 2, 7])
    >>> type(rints[0])
    <class 'numpy.int64'>
    
    Here we specify a seed so that we have reproducible results:
    
    >>> import numpy as np
    >>> rng = np.random.default_rng(seed=42)
    >>> print(rng)
    Generator(PCG64)
    >>> arr1 = rng.random((3, 3))
    >>> arr1
    array([[0.77395605, 0.43887844, 0.85859792],
           [0.69736803, 0.09417735, 0.97562235],
           [0.7611397 , 0.78606431, 0.12811363]])

    If we exit and restart our Python interpreter, we'll see that we
    generate the same random numbers again:

    >>> import numpy as np
    >>> rng = np.random.default_rng(seed=42)
    >>> arr2 = rng.random((3, 3))
    >>> arr2
    array([[0.77395605, 0.43887844, 0.85859792],
           [0.69736803, 0.09417735, 0.97562235],
           [0.7611397 , 0.78606431, 0.12811363]])

    

Attributes are another important thing we’ll learn about, especially when we get into pyspark. We now have

  • functions() which go prior to the object
  • .methods() that go on the end of the object

and

  • .attributes that also go on the end of an object just with no ().

Other Things to Note

  • When executing a function, a new symbol table is used for the local variables
  • This keeps us from accidentally overwriting something
import numpy as np
y = np.array(range(1,11))

def square(z):
    y = z**2
    print("In the function environment, z = " + str(z) + " and y = " + str(y))
    return(y)

print(square(y))
print(y) #y is not changed
In the function environment, z = [ 1  2  3  4  5  6  7  8  9 10] and y = [  1   4   9  16  25  36  49  64  81 100]
[  1   4   9  16  25  36  49  64  81 100]
[ 1  2  3  4  5  6  7  8  9 10]
print(z) #z isn't defined outside the function call! error
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-36-7d7ac3dfdf36> in <cell line: 1>()
----> 1 print(z) #z isn't defined outside the function call! error

NameError: name 'z' is not defined
  • You can define global variables from within a function using global
def square(z):
    global y
    y = z**2
    print("In the function environment, z = " + str(z) + " and y = " + str(y))
    return(y)

print(square(y))
print(y) #modified globally now
In the function environment, z = [ 1  2  3  4  5  6  7  8  9 10] and y = [  1   4   9  16  25  36  49  64  81 100]
[  1   4   9  16  25  36  49  64  81 100]
[  1   4   9  16  25  36  49  64  81 100]
  • If nothing is returned from a function (with return) then it actually returns the special None
def square_it(a):
    if (type(a) == int) or (type(a) == float):
      return a**2
    else:
      return

print(square_it(10))
print(square_it(10.5))
print(square_it("10"))
100
110.25
None
  • Default values are only evaluated once - at the time of the function definition

  • Mutable objects can cause an issue! (Lists are mutable as they can be changed, some objects, like tuples, are immutable and can’t be modified.)

#append a value to a list but give a default empty list if not given
def my_append(value, L = []):
    L.append(value)
    return L

#correctly appends "A" to the list
print(my_append("A"))
#appends "B" to the previous list as L = [] was only evaluated at the time the function was created!
print(my_append("B"))
['A']
['A', 'B']
  • To avoid this behavior, instead define the default value as None and take care of things within the function body
def my_append(value, L = None):
    if L is None:
        L = []
    L.append(value)
    return L

print(my_append("A"))
print(my_append("B"))
['A']
['B']

Video Demo

This quick video demo gives another example of creating our own function! Remember to pop the video out into the full player.

The notebook written in the video is available here.

from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=ae1858b3-74cf-4065-8ec7-b0f800e4f827&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

Recap

  • Writing functions is super cool!
def func_name(args):
    """
    Doc string
    """
    body
    return object
  • Many ways to set up your function arguments and to call your function

  • Even more on function writing will be covered later!

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!