Neural Network Basics: Linear Regression with PyTorch

In just a few short years, PyTorch took the crown for most popular deep learning framework. Its concise and straightforward API allows for custom changes to popular networks and layers. While some of the descriptions may some foreign to mathematicians, the concepts are familiar to anyone with a little experience in machine learning. This post will walk the user from a simple linear regression to an (overkill) neural network model, with thousands of parameters, which provides a good base for future learning.

Configuration

Setup our environment with the basic libraries and necessary data.

import numpy as np
import torch
torch.set_printoptions(edgeitems=2)
c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]

Progressive Estimation

Estimation: QR Decomposition

R is the Domain Specific Language for statistics, and we will use R’s well-known lm() function for making initial estimates for later comparisons. The lm() function uses QR decomposition for solving the normal equations for the parameters.

import pandas as pd
df = pd.DataFrame({'c': c, 'u': u })
%load_ext rpy2.ipython
%%R -i df -w 400 -h 300

f <- lm(c ~ u, df)
print( summary(f)$coeff )
print( sprintf("MSE=%0.2f", mean(f$residuals^2) ))
plot(c ~ u, df)
               Estimate Std. Error  t value     Pr(>|t|)
(Intercept) -17.3047855  1.9272273 -8.97911 8.701879e-06
u             0.5367719  0.0355386 15.10391 1.062435e-07
[1] "MSE=2.93"

png

Prepare tensors

Lets create PyTorch tensors out of our data and create basic implementations of the model and loss functions.

t_c = torch.tensor(c)
t_u = torch.tensor(u)
def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

#initialize parameters
w = torch.ones(())
b = torch.zeros(())

t_p = model(t_u, w, b)
t_p

loss = loss_fn(t_p, t_c)
loss
tensor(1763.8846)

Naive GD algorithm

The naive gradient descent algorithm displays the basic idea for updating parameter estimates over a solution surface, but this is too simple for a solution.

delta = 0.1
learning_rate = 1e-2

loss_rate_of_change_w = \
    (loss_fn(model(t_u, w + delta, b), t_c) - 
     loss_fn(model(t_u, w - delta, b), t_c)) / (2.0 * delta)
w = w - learning_rate * loss_rate_of_change_w


loss_rate_of_change_b = \
    (loss_fn(model(t_u, w, b + delta), t_c) - 
     loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)
b = b - learning_rate * loss_rate_of_change_b

print(w, b)
tensor(-44.1730) tensor(46.0250)

Analytical GD method

We can create a gradient function, analytically, by taking derivates (chain rule) with respect to the parameters. A longer derivation can be found in ‘The Elements of Statistical Learning’, but the gist is that updates can be done in 2 passes:

  • forward: fixed weights are used to compute predicted values
  • backward:
    • errors (delta) are computed
    • errors (s) are ‘back-propogated’ using the back-propogation equation
    • both delta and s errors are used to compute gradients the updates in gradient descent
def dloss_fn(t_p, t_c):
    dsq_diffs = 2 * (t_p - t_c) / t_p.size(0)  # <1>
    return dsq_diffs

def dmodel_dw(t_u, w, b):
    return t_u

def dmodel_db(t_u, w, b):
    return 1.0

def grad_fn(t_u, t_c, t_p, w, b):
    dloss_dtp = dloss_fn(t_p, t_c)
    dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
    dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
    return torch.stack([dloss_dw.sum(), dloss_db.sum()])  # <1>
def training_loop(n_epochs, learning_rate, params, t_u, t_c, print_params=True):
    for epoch in range(1, n_epochs + 1):
        w, b = params

        t_p = model(t_u, w, b)                 #Forward Pass (prediction)
        loss = loss_fn(t_p, t_c)
        grad = grad_fn(t_u, t_c, t_p, w, b)    #Backward Pass (compute gradient)

        params = params - learning_rate * grad

        if epoch in {1, 2, 3, 10, 11, 99, 100, 4000, 5000}:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            if print_params:
                print('    Params:', params)
                print('    Grad:  ', grad)
        if epoch in {4, 12, 101}:
            print('...')

        if not torch.isfinite(loss).all():     #check for divergence (updates are too large)
            break 
            
    return params

Fix divergence with different approaches, including:

  • ensure similar parameters (normalize input columns)
  • adaptive learning rate
t_un = 0.1 * t_u
params = training_loop(
    n_epochs = 5000, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0]), 
    t_u = t_un, 
    t_c = t_c,
    print_params = False)

params
Epoch 1, Loss 80.364342
Epoch 2, Loss 37.574917
Epoch 3, Loss 30.871077
...
Epoch 10, Loss 29.030487
Epoch 11, Loss 28.941875
...
Epoch 99, Loss 22.214186
Epoch 100, Loss 22.148710
...
Epoch 4000, Loss 2.927680
Epoch 5000, Loss 2.927648
tensor([  5.3671, -17.3012])

PyTorch Components

The PyTorch API is well designed, but there are many assumptions incorporated into the functionality. Be sure you know these basics, thoroughly.

Using autograd

Back-propagation: we computed the gradient of a composition of functions - the model and the loss - with respect to their inner-most parameters - w and b - by propagating derivatives backwards using the chain rule.

Given a forward expression, no matter how nested, PyTorch will provide the gradient of that expression with respect to its input parameters automatically. This is because PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, and they can provide the chain of derivatives of such operations with respect to their inputs automatically.

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

params = torch.tensor([1.0, 0.0], requires_grad=True)
params.grad is None
True

That argument requires_grad=True is telling PyTorch to track the entire family tree of tensors resulting from operations on params. In other words, any tensor that will have params as an ancestor will have access to the chain of functions that were called to get from params to that tensor. In case these functions are differentiable (and most PyTorch tensor operations will be), the value of the derivative will be automatically populated as a grad attribute of the params tensor.

All we have to do to populate it is to start with a tensor with requires_grad set to True, then call the model (predict new values), compute the loss, and then call backward on the loss tensor. The grad attribute of params contains the derivatives of the loss with respect to each element of params.

loss = loss_fn(model(t_u, *params), t_c)
loss.backward()
params.grad
tensor([4517.2969,   82.6000])

WARNING: Calling backward will lead derivatives to accumulate (summed) at leaf nodes. We need to zero the gradient explicitly after using it for parameter updates.

if params.grad is not None:
    params.grad.zero_()
params.grad
tensor([0., 0.])

The logic inside the with statement will be used with an ‘optimizer’.

def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:     #call prior to loss.backward()
            params.grad.zero_()
        
        t_p = model(t_u, *params) 
        loss = loss_fn(t_p, t_c)
        loss.backward()
        
        with torch.no_grad():          #autograd mechanism should not add edges to the forward graph 
            params -= learning_rate * params.grad        #keep the same tensor params around, but subtract our update from it

        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            
    return params
training_loop(
    n_epochs = 5000,
    learning_rate = 1e-2,
    params = torch.tensor([1.0, 0.0], requires_grad=True),
    t_u = t_un,
    t_c = t_c)
Epoch 500, Loss 7.860116
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647
tensor([  5.3671, -17.3012], requires_grad=True)

Selecting an optimizer

import torch.optim as optim

dir(optim)[:10]
['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop']

The optimizer is used with four basic steps:

  • optimizer holds a reference to parameters, and
  • after a loss is computed from inputs
  • a call to .backward() leads to .grad() being populated on parameters, then
  • the optimizer can access .grad() and compute the parameter updates

The optimizer exposes two methods:

  • .zero_grad() - zeroes the grad attribute of all the parameters passed to the optimizer
  • .step() - updates the value of those parameters according to the specific optimization strategy
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate)
#complete step
t_p = model(t_un, *params)
loss = loss_fn(t_p, t_c)
optimizer.zero_grad()    #must zero-out params
loss.backward()
optimizer.step()
params
tensor([1.0008e+00, 1.0640e-04], requires_grad=True)
def training_loop(n_epochs, optimizer, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)    #same params

training_loop(
    n_epochs = 5000,
    optimizer = optimizer,
    params = params,        #same params
    t_u = t_un,
    t_c = t_c)
Epoch 500, Loss 7.860118
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927680
Epoch 4500, Loss 2.927651
Epoch 5000, Loss 2.927648
tensor([  5.3671, -17.3012], requires_grad=True)

Choosing activation functions

A neural network is actually just a polynomial function with ‘activation’ functions around the nested terms. This small list of activation functions gives an idea of the most useful properties. While sigmoid was the most orthodox, originally, Rectified Linear Units (ReLU) are shown to be better.

import numpy as np
import matplotlib.pyplot as plt
input_t = torch.arange(-3, 3.1, 0.1)
activation_list = [
    nn.Tanh(),
    nn.Hardtanh(),
    nn.Sigmoid(),
    
    nn.Softplus(),
    nn.ReLU(),
    nn.LeakyReLU(negative_slope=0.1),

    #nn.Tanhshrink(),
    #nn.Softshrink(),
    #nn.Hardshrink(),
]
fig = plt.figure(figsize=(14, 28), dpi=600)

for i, activation_func in enumerate(activation_list):
    subplot = fig.add_subplot(len(activation_list), 3, i+1)
    subplot.set_title(type(activation_func).__name__)
    
    output_t = activation_func(input_t)
    
    plt.grid()
    plt.plot(input_t.numpy(), input_t.numpy(),'k', linewidth=1)
    plt.plot([-3,3], [0,0], 'k', linewidth=1)
    plt.plot([0,0], [-3,3], 'k', linewidth=1)
    plt.plot(input_t.numpy(), output_t.numpy(), 'r', linewidth=3)

output_t
tensor([-0.3000, -0.2900, -0.2800, -0.2700, -0.2600, -0.2500, -0.2400, -0.2300,
        -0.2200, -0.2100, -0.2000, -0.1900, -0.1800, -0.1700, -0.1600, -0.1500,
        -0.1400, -0.1300, -0.1200, -0.1100, -0.1000, -0.0900, -0.0800, -0.0700,
        -0.0600, -0.0500, -0.0400, -0.0300, -0.0200, -0.0100,  0.0000,  0.1000,
         0.2000,  0.3000,  0.4000,  0.5000,  0.6000,  0.7000,  0.8000,  0.9000,
         1.0000,  1.1000,  1.2000,  1.3000,  1.4000,  1.5000,  1.6000,  1.7000,
         1.8000,  1.9000,  2.0000,  2.1000,  2.2000,  2.3000,  2.4000,  2.5000,
         2.6000,  2.7000,  2.8000,  2.9000,  3.0000])

png

Preparing the training-validation split

The first line in the training loop evaluates model on train_t_u to produce train_t_p. Then train_loss is evaluated from train_t_p. This creates a computation graph that links train_t_u to train_t_p to train_loss. When model is evaluated again on val_t_u, it produces val_t_p and val_loss. In this case, a separate computation graph will be created that links val_t_u to val_t_p to val_loss. Separate tensors have been run through the same functions, model() and loss_fn(), generating separate computation graphs.

Since we’re never calling backward() on val_loss, why are we building the graph in the first place? We could in fact just call model() and loss_fn() as plain functions, without tracking history. However optimized, tracking history comes with additional costs that we could totally forego during the validation pass, especially when the model has millions of parameters. In order to address this, PyTorch allows us to switch off autograd when we don’t need it using the torch.no_grad context manager.

n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices
(tensor([ 8,  5, 10,  0,  1,  6,  3,  2,  7]), tensor([4, 9]))
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]

val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params)       #training data
        train_loss = loss_fn(train_t_p, train_t_c)
                             
        with torch.no_grad():                       #switch-off autograd when we don’t need it                         
            val_t_p = model(val_t_u, *params)       #validation data, separate computation graph will be created
            val_loss = loss_fn(val_t_p, val_t_c)
            assert val_loss.requires_grad == False  #ensure autograd is off
        
        optimizer.zero_grad()
        train_loss.backward()          #backward only called on train_loss: accumulated the derivatives on the leaf nodes
        optimizer.step()

        if epoch <= 3 or epoch % 500 == 0:
            print('Epoch {}, Training loss {}, Validation loss {}'.format(
                epoch, float(train_loss), float(val_loss)))
            
    return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    params = params,
    train_t_u = train_t_un,        #training dependent var
    val_t_u = val_t_un,            #validation dependent var 
    train_t_c = train_t_c,
    val_t_c = val_t_c)
Epoch 1, Training loss 93.96257781982422, Validation loss 19.172250747680664
Epoch 2, Training loss 30.066646575927734, Validation loss 43.27933120727539
Epoch 3, Training loss 24.354103088378906, Validation loss 53.700531005859375
Epoch 500, Training loss 11.108868598937988, Validation loss 12.03146743774414
Epoch 1000, Training loss 5.690493106842041, Validation loss 0.8163924217224121
Epoch 1500, Training loss 3.397932529449463, Validation loss 1.738821268081665
Epoch 2000, Training loss 2.4279325008392334, Validation loss 5.815708160400391
Epoch 2500, Training loss 2.017514944076538, Validation loss 9.938664436340332
Epoch 3000, Training loss 1.843862771987915, Validation loss 13.242980003356934
tensor([  5.8612, -20.4629], requires_grad=True)

Our main goal is to also see both the training loss and the validation loss decreasing. While ideally both losses would be rougly the same value, as long as validation loss stays reasonably close to the training loss, we know that our model is continuing to learn generalized things about our data.

Working with Layers (nn.Module)

A PyTorch module is a Python class deriving from the nn.Module base class. A Module can have one or more Parameter instances as attributes, which are tensors whose values are optimized during the training process (think w and b in our linear model). A Module can also have one or more submodules (subclasses of nn.Module) as attributes, and it will be able to track their Parameters as well.

NOTE The submodules must be top-level attributes, not buried inside list or dict instances! Otherwise the optimizer will not be able to locate the submodules (and hence their parameters). For situations where your model requires a list or dict of submodules, PyTorch provides nn.ModuleList and nn.ModuleDict.

All PyTorch-provided subclasses of nn.Module have their __call__ method defined. This allows one to instantiate an nn.Linear and call it as if it was a function, like so:

c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(c).unsqueeze(1)     #add an extra dimension for batch
t_u = torch.tensor(u).unsqueeze(1)     #same
t_c[:3]
tensor([[ 0.5000],
        [14.0000],
        [15.0000]])
import torch.nn as nn
linear_model = nn.Linear(1, 1)     #args: input size, output size, and bias defaulting to True.
linear_model(t_c[:3])
tensor([[-0.8724],
        [-4.6115],
        [-4.8885]], grad_fn=<AddmmBackward>)

Calling an instance of nn.Module with a set of arguments ends up calling a method named forward with the same arguments. The forward method is what executes the forward computation, while __call__ does other rather important chores before and after calling forward. So, it is technically possible to call forward directly and it will produce the same output as __call__, but it should not be done from user code:

y = model(x)              #correct
y = model.forward(x)      #don't do this
linear_model.weight, linear_model.bias
(Parameter containing:
 tensor([[-0.2770]], requires_grad=True), Parameter containing:
 tensor([-0.7339], requires_grad=True))

Any module in nn is written to produce outputs for a batch of multiple inputs at the same time. Modules expect the zeroth dimension of the input to be the number of samples in the batch.

def training_loop(n_epochs, optimizer, model, loss_fn, t_u_train, t_u_val, t_c_train, t_c_val):
    for epoch in range(1, n_epochs + 1):
        t_p_train = model(t_u_train)        #add model as an input
        loss_train = loss_fn(t_p_train, t_c_train)
        t_p_val = model(t_u_val)
        loss_val = loss_fn(t_p_val, t_c_val)
        optimizer.zero_grad()
        loss_train.backward()
        optimizer.step()
        if epoch == 1 or epoch % 1000 == 0:
            print('Epoch {}, Training loss {}, Validation loss {}'.format(
                epoch, float(loss_train), float(loss_val)))
linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(
    linear_model.parameters(),    #replaced [params] with this method call 
    lr=1e-2)
training_loop(
    n_epochs = 3000,
    optimizer = optimizer,
    model = linear_model,
    loss_fn = nn.MSELoss(),
    t_u_train = train_t_un,
    t_u_val = val_t_un,
    t_c_train = train_t_c,
    t_c_val = val_t_c)

print()
print(linear_model.weight)
print(linear_model.bias)
Epoch 1, Training loss 184.62193298339844, Validation loss 131.338623046875
Epoch 1000, Training loss 2.996669054031372, Validation loss 6.098757266998291
Epoch 2000, Training loss 2.4377217292785645, Validation loss 6.508167743682861
Epoch 3000, Training loss 2.4287760257720947, Validation loss 6.561643123626709

Parameter containing:
tensor([[5.4813]], requires_grad=True)
Parameter containing:
tensor([-17.4250], requires_grad=True)

Create a NeuralNetwork

Simple models

Let’s build the simplest possible neural network: a linear module, followed by an activation function, feeding into another linear module. The first linear + activation layer is commonly referred to as a hidden layer for historical reasons, since its outputs are not observed directly but fed into the output layer.

seq_model = nn.Sequential(
    nn.Linear(1, 13),
    nn.Tanh(),
    nn.Linear(13, 1))

After calling model.backward() all parameters will be populated with their grad and the optimizer will then update their value accordingly during the optimizer.step() call.

[print(param.shape) for param in seq_model.parameters()]

print()
for name, param in seq_model.named_parameters():
    print(name, param.shape)
torch.Size([13, 1])
torch.Size([13])
torch.Size([1, 13])
torch.Size([1])

0.weight torch.Size([13, 1])
0.bias torch.Size([13])
2.weight torch.Size([1, 13])
2.bias torch.Size([1])
from collections import OrderedDict
namedseq_model = nn.Sequential(OrderedDict([
    ('hidden_linear', nn.Linear(1, 13)),
    ('hidden_activation', nn.Tanh()),
    ('output_linear', nn.Linear(13, 1))
]))
namedseq_model.output_linear.bias
Parameter containing:
tensor([-0.1054], requires_grad=True)
optimizer = optim.SGD(
    seq_model.parameters(), 
    lr=1e-3)

training_loop(
    n_epochs = 5000,
    optimizer = optimizer,
    model = seq_model,
    loss_fn = nn.MSELoss(),
    t_u_train = train_t_un,
    t_u_val = val_t_un,
    t_c_train = train_t_c,
    t_c_val = val_t_c)

print()
print('output', seq_model(val_t_un))
print('answer', val_t_c)
print('hidden', seq_model.hidden_linear.weight.grad)
Epoch 1, Training loss 1.2322325706481934, Validation loss 7.267159461975098
Epoch 1000, Training loss 1.2229273319244385, Validation loss 7.296848773956299
Epoch 2000, Training loss 1.2134442329406738, Validation loss 7.330402374267578
Epoch 3000, Training loss 1.2036887407302856, Validation loss 7.36702823638916
Epoch 4000, Training loss 1.193618655204773, Validation loss 7.405879974365234
Epoch 5000, Training loss 1.1832159757614136, Validation loss 7.446534156799316

output tensor([[13.1834],
        [16.1821]], grad_fn=<AddmmBackward>)
answer tensor([[11.],
        [13.]])
hidden tensor([[ 0.0005],
        [-0.0007],
        [ 0.0138],
        [ 0.0057],
        [-0.0029],
        [ 0.0027],
        [ 0.0025],
        [-0.0165],
        [ 0.0027],
        [-0.0030],
        [-0.0018],
        [ 0.0017],
        [-0.0152]])

Subclassing the nn.Module

In order to subclass nn.Module, at a minimum we need to define a .forward(…) function that takes the input to the module and returns the output. If we use standard torch operations, autograd will take care of the backward pass automatically. Often your entire model will be implemented as a subclass of nn.Module, which can, in turn, contain submodules that are also subclasses of nn.Module.

Assigning an instance of nn.Module to an attribute in a nn.Module, just like we did in the constructor here, automatically registers the module as a submodule. This allows modules to have access to the parameters of its submodules without further action by the user.

class SubclassModel(nn.Module):
    def __init__(self):
        super().__init__()        #calls nn.Module 's __init__ which sets up the housekeeping
        self.hidden_linear = nn.Linear(1, 13)
        self.hidden_activation = nn.Tanh()
        self.output_linear = nn.Linear(13, 1)
    def forward(self, input):
        hidden_t = self.hidden_linear(input)
        activated_t = self.hidden_activation(hidden_t)
        output_t = self.output_linear(activated_t)
        return output_t
    
subclass_model = SubclassModel()
subclass_model
SubclassModel(
  (hidden_linear): Linear(in_features=1, out_features=13, bias=True)
  (hidden_activation): Tanh()
  (output_linear): Linear(in_features=13, out_features=1, bias=True)
)
for type_str, model in [('seq', seq_model), ('namedseq', namedseq_model), ('subclass', subclass_model)]:
    print(type_str)
    for name_str, param in model.named_parameters():
        print("{:21} {:19} {}".format(name_str, str(param.shape), param.numel()))
    print()
seq
hidden_linear.weight  torch.Size([13, 1]) 13
hidden_linear.bias    torch.Size([13])    13
output_linear.weight  torch.Size([1, 13]) 13
output_linear.bias    torch.Size([1])     1

namedseq
hidden_linear.weight  torch.Size([13, 1]) 13
hidden_linear.bias    torch.Size([13])    13
output_linear.weight  torch.Size([1, 13]) 13
output_linear.bias    torch.Size([1])     1

subclass
hidden_linear.weight  torch.Size([13, 1]) 13
hidden_linear.bias    torch.Size([13])    13
output_linear.weight  torch.Size([1, 13]) 13
output_linear.bias    torch.Size([1])     1

Using the functional API

torch.nn.functional provides the many of the same modules we find in nn, but with all eventual parameters moved as an argument to the function call. By “functional” here we mean “having no internal state”, or, in other words, “whose output value is solely and fully determined by the value input arguments”. The functional counterpart of nn.Linear is nn.functional.linear.

A Module is a container for state in forms of Parameters and submodules combined with the instructions to do a forward.

class SubclassFunctionalModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden_linear = nn.Linear(1, 14)
        self.output_linear = nn.Linear(14, 1)
    def forward(self, input):
        hidden_t = self.hidden_linear(input)
        activated_t = torch.tanh(hidden_t)       #use functional form, no state is needed
        output_t = self.output_linear(activated_t)
        return output_t
    
func_model = SubclassFunctionalModel()
func_model
SubclassFunctionalModel(
  (hidden_linear): Linear(in_features=1, out_features=14, bias=True)
  (output_linear): Linear(in_features=14, out_features=1, bias=True)
)
optimizer = optim.SGD(
    func_model.parameters(), 
    lr=1e-3)

training_loop(
    n_epochs = 5000,
    optimizer = optimizer,
    model = func_model,
    loss_fn = nn.MSELoss(),
    t_u_train = train_t_un,
    t_u_val = val_t_un,
    t_c_train = train_t_c,
    t_c_val = val_t_c)

print()
print('output', seq_model(val_t_un))
print('answer', val_t_c)
print('hidden', seq_model.hidden_linear.weight.grad)
Epoch 1, Training loss 179.48944091796875, Validation loss 125.14144897460938
Epoch 1000, Training loss 2.4361188411712646, Validation loss 9.148082733154297
Epoch 2000, Training loss 1.422125220298767, Validation loss 7.820898532867432
Epoch 3000, Training loss 1.3350441455841064, Validation loss 7.297579765319824
Epoch 4000, Training loss 1.3103053569793701, Validation loss 7.1512041091918945
Epoch 5000, Training loss 1.2794909477233887, Validation loss 7.114054203033447

output tensor([[13.1834],
        [16.1821]], grad_fn=<AddmmBackward>)
answer tensor([[11.],
        [13.]])
hidden tensor([[ 0.0005],
        [-0.0007],
        [ 0.0138],
        [ 0.0057],
        [-0.0029],
        [ 0.0027],
        [ 0.0025],
        [-0.0165],
        [ 0.0027],
        [-0.0030],
        [-0.0018],
        [ 0.0017],
        [-0.0152]])

Conclusion

This post describes the fundamentals of PyTorch neural networks as they are applied to a simple linear regression. Because of implicit aspects of this functionality, these must be understood before trying more challenging problems.