Questionnaire

How is a greyscale image represented on a computer? How about a color image?
How are the files and folders in the MNIST_SAMPLE dataset structured? Why?
Explain how the "pixel similarity" approach to classifying digits works.
What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.
What is a "rank 3 tensor"?
What is the difference between tensor rank and shape? How do you get the rank from the shape?
What are RMSE and L1 norm?
How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom right 4 numbers.
What is broadcasting?
Are metrics generally calculated using the training set, or the validation set? Why?
What is SGD?
Why does SGD use mini batches?
What are the 7 steps in SGD for machine learning?
How do we initialize the weights in a model?
What is "loss"?
Why can't we always use a high learning rate?
What is a "gradient"?
Do you need to know how to calculate gradients yourself?
Why can't we use accuracy as a loss function?
Draw the sigmoid function. What is special about its shape?
What is the difference between loss and metric?
What is the function to calculate new weights using a learning rate?
What does the DataLoader class do?
Write pseudo-code showing the basic steps taken each epoch for SGD.
Create a function which, if passed two arguments [1,2,3,4] and 'abcd', returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure?
What does view do in PyTorch?
What are the "bias" parameters in a neural network? Why do we need them?
What does the @ operator do in python?
What does the backward method do?
Why do we have to zero the gradients?
What information do we have to pass to Learner?
Show python or pseudo-code for the basic steps of a training loop.
What is "ReLU"? Draw a plot of it for values from -2 to +2.
What is an "activation function"?
What's the difference between F.relu and nn.ReLU?
The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?