DEV Community

Cover image for Vectors in Python
Paul Apivat
Paul Apivat

Posted on • Edited on • Originally published at paulapivat.com

Vectors in Python

Overview

We'll see the from scratch aspect of the book* play out as we implement several building block functions to help us work towards defining the Euclidean Distance in code:

Alt Text

  • note: This is chapter 4, Linear Algebra, of Data Science from Scratch by Joel Grus.

While we don't see its application immediately, we can expect to see the Euclidean Distance used for K-nearest neighbors (classication) or K-means (clustering) to find the "k closest points" (source). (note : there are other types of distance formulas used as well.)

En route towards implementing the Euclidean Distance, we also implement the sum of squares which is a crucial piece for how regression works.

Thus, the from scratch aspect of this book works on two levels. Within this chapter, we're building piece by piece up to an important distance and sum of squares formula. But we're also building tools we'll use in subsequent chapters.

Vectors

We start off with implementing functions to add and subtract two vectors. We also create a function for component wise sum of a list of vectors, where a new vector is created whose first element is the sum of all the first elements in the list and so on.

We then create a function to multiply a vector by scalar, which we use to compute the component wise mean of a list of vectors.

We also create the dot product of two vectors or the sum of their component wise product, and this is is the generalize version of the sum of squares. At this point, we have enough to implement the Euclidean distance. Let's take a look at the code:

Example Vectors

Vectors are simply a list of numbers:

height_weight_age = [70,170,40]

grades = [95,80,75,62]
Enter fullscreen mode Exit fullscreen mode

Add

You'll note that we do type annotation on our code throughout. This is a convention advocated by the author (and as a newcomer to Python, I like the idea of being explicit about data type for a function's input and output).

from typing import List

Vector = List[float]

def add(v: Vector, w: Vector) -> Vector:
    """Adds corresponding elements"""
    assert len(v) == len(w), "vectors must be the same length"
    return [v_i + w_i for v_i, w_i in zip(v,w)]

assert add([1,2,3], [4,5,6]) == [5,7,9]
Enter fullscreen mode Exit fullscreen mode

Here's another view of what's going on with the add function:

Alt Text

Subtract

def subtract(v: Vector, w: Vector) -> Vector:
    """Subtracts corresponding elements"""
    assert len(v) == len(w), "vectors must be the same length"
    return [v_i - w_i for v_i, w_i in zip(v,w)]

assert subtract([5,7,9], [4,5,6]) == [1,2,3]
Enter fullscreen mode Exit fullscreen mode

This is pretty much the same as the previous:

Alt Text

Componentwise Sum

def vector_sum(vectors: List[Vector]) -> Vector:
    """Sum all corresponding elements (componentwise sum)"""
    # Check that vectors is not empty
    assert vectors, "no vectors provided!"
    # Check the vectorss are all the same size
    num_elements = len(vectors[0])
    assert all(len(v) == num_elements for v in vectors), "different sizes!"
    # the i-th element of the result is the sum of every vector[i]
    return [sum(vector[i] for vector in vectors)
            for i in range(num_elements)]

assert vector_sum([[1,2], [3,4], [5,6], [7,8]]) == [16,20]
Enter fullscreen mode Exit fullscreen mode

Here, a list of vectors becomes one vector. If you go back to the add function, it takes two vectors, so if we tried to give it four vectors, we'd get a TypeError. So we wrap four vectors in a list and provide that as the argument for vector_sum:

Alt Text

Multiply Vector with a Number

def scalar_multiply(c: float, v: Vector) -> Vector:
    """Multiplies every element by c"""
    return [c * v_i for v_i in v]

assert scalar_multiply(2, [2,4,6]) == [4,8,12]
Enter fullscreen mode Exit fullscreen mode

One number is multiplied with all numbers in the vector, with the vector retaining its length:

Alt Text

Componentwise Mean

This is similar to componentwise sum (see above); a list of vectors becomes one vector.

def vector_mean(vectors: List[Vector]) -> Vector: 
    """Computes the element-wise average"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

assert vector_mean([ [1,2], [3,4], [5,6] ]) == [3,4]
Enter fullscreen mode Exit fullscreen mode

Dot Product

def dot(v: Vector, w: Vector) -> float:
    """Computes v_1 * w_1 + ... + v_n * w_n"""
    assert len(v) == len(w), "vectors must be the same length"
    return sum(v_i * w_i for v_i, w_i in zip(v,w))

assert dot([1,2,3], [4,5,6]) == 32
Enter fullscreen mode Exit fullscreen mode

Here we multiply the elements, then sum their results. Two vectors becomes a single number (float):

Alt Text

Sum of Squares

def sum_of_squares(v: Vector) -> float:
    """Returns v_1 * v_1 + ... + v_n * v_n"""
    return dot(v,v)

assert sum_of_squares([1,2,3]) == 14
Enter fullscreen mode Exit fullscreen mode

In fact, sum_of_squares is a special case of dot product:

Alt Text

Magnitude

def magnitude(v: Vector) -> float:
    """Returns  the magnitude (or length) of v"""
    return math.sqrt(sum_of_squares(v)) # math.sqrt is the square root function

assert magnitude([3,4]) == 5
Enter fullscreen mode Exit fullscreen mode

With magnitude we square root the sum_of_squares. This is none other than the pythagorean theorem.

Alt Text

Squared Distance

def squared_distance(v: Vector, w: Vector) -> float:
    """Computes (v_1 - w_1) ** 2 + ... + (v_n - w_n) ** 2"""
    return sum_of_squares(subtract(v,w))
Enter fullscreen mode Exit fullscreen mode

This is the distance between two vectors, squared.

Alt Text

(Euclidean) Distance

import math

def distance(v: Vector, w: Vector) -> float:
    """Also computes the distance between v and w"""
    return math.sqrt(squared_distance(v,w))
Enter fullscreen mode Exit fullscreen mode

Finally, we square root the squared_distance to get the (euclidean) distance:

Alt Text

Summary

We literally built from scratch, albeit with some help from Python's math module, the blocks needed for essential functions that we'll expect to use later, namely: the sum_of_squares and distance.

It's pretty cool to see these foundational concepts set us up to understand more complex machine learning algorithms like regression, k-nearest neighbors (classification), k-means (clustering) and even touch on the pythagorean theorem.

We'll examine matrices next.

Alt Text

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Top comments (2)

Collapse
 
cipharius profile image
Valts Liepiņš • Edited

Doing things from scratch is great for learning, but I think it's worth showcasing how same can be done via numpy, which would be better choice for real world applications:

import numpy as np

# Vector in numpy
height_weight_age = np.asarray([70, 170, 40])
grades = np.asarray([85, 80, 75, 62])

# Vector addition / subtraction
print( np.asarray([1, 2, 3]) + np.asarray([4, 5, 6]) ) # => array([5, 7, 9])
print( np.asarray([5, 7, 9]) - np.asarray([4, 5, 6]) ) # => array([5, 7, 9])

# Componentwise sum
# Without `axis=0`, it would sum all matrix components
print( np.asarray([[1,2], [3,4], [5,6], [7,8]]).sum(axis=0) ) # => array([16, 20])

# Multiplying vector with a number
print( 2 * np.asarray([2, 4, 6]) ) # => array([4, 8, 12])

# Componentwise mean
print( np.asarray([[1,2], [3,4], [5,6]]).mean(axis=0) ) # => array([3., 4.])

# Dot product
print( np.asarray([1, 2, 3]).dot(np.asarray([4, 5, 6])) ) # => 32

# Sum of squares
# Equivalent to dot with self
print( np.asarray([1, 2, 3]).dot(np.asarray([1, 2, 3])) ) # => 14

# Magnitude
print( np.linalg.norm(np.asarray([3, 4])) ) # => 5.0

# Squared difference
def squared_difference(v, w):
  return (v - w).dot(v - w)

# Euclidean distance
def distance(v, w):
  return np.linalg.norm(v - w)
Enter fullscreen mode Exit fullscreen mode

Even if some of the cases might look more confusing, greatest advantage of numpy is that it's optimized for maniuplating with large, multidimensional data arrays, which is perfect for data analytics use.

Collapse
 
paulapivat profile image
Paul Apivat

Absolutely agree. Thank you for sharing this!