Questionnaire

Do you need these for deep learning?
- Lots of math T / F
  - False - Only linear algebra + matrix math + some statistics. but it helps to have some math background to understand the underlying principles. SGD itself just relies on addition/multiplication.
- Lots of data T / F
  - False - With transfer learning, you can reduce data needed; some can do with just a little data, but it depends - some problems may require more data than others. Creativity around the problem of sourcing data is key to working with limited data.
- Lots of expensive computers T / F
  - False - Again, with transfer learning, you can reduce computing requirements; aside from building your own inexpensive DL rig with NVIDIA GPUs similar to gaming rigs, you can also use cloud-based solutions (both free and paid).
- A PhD T / F
  - False - practitioners can come from different backgrounds. Lots of major DL contributors are not necessarily PhDs; for example - Joseph Redmon, aka pjreddie, author of YOLO detector, a well regarded object detection library was a CS Major with no PhD.
Name five areas where deep learning is now the best in the world.
- Computer Vision - image classification, image segmentation, satellite and drone imagery interpretation, facial recognition, image captioning, self driving
- Medical Imaging and Diagnosis - radiology - cancer detection, diabetic retinopathy, anomalies in radiology images, CT scans, x-ray, MRI
- Text Processing - language translation, speech to text, answering questions, summarizing docs
- Recommendation Systems - Netflix, Youtube, Amazon
- Games - Alphago, Alphastar
What was the name of the first device that was based on the principle of the artificial neuron?
- Mark 1 Perceptron
Based on the book of the same name, what are the requirements for "Parallel Distributed Processing"?
- A set of processing units (nodes/neurons - param weight)
- A state of activation (activation - computation of weight + input)
- An output function for each unit (linear function + activation function e.g. ReLU)
- A pattern of connectivity among units (layers of nodes using outputs from prev layer as input to next layer)
- A propagation rule for propagating patterns of activities through the network of connectivities (forward propagation from layer to layer)
- An activation rule for combining the inputs impinging on a unit with the current state of activation to produce an output for that unit (combining activations from connected nodes)
- An learning rule where the patterns of connectivity are modified by experience (back propagation - feedback error to update param - weights)
- An environment for the machine to operate on (cycle of input+weights-> propagate-> output - performance - update weights)
What were the two theoretical misunderstandings that held back the field of neural networks?
- that neural networks could not compute even a simple XOR function (1 layer networks can't, but it can be overcome by adding more layers)
- that only 1 extra layer is needed to theoretically compute any function (Two layer neural networks - while theoretically could approximate any function, multiple layers were better)
What is a GPU?
- Graphical Processing Unit - a processor for displaying graphics, repurposed for parallelized operations needed for machine learning
Open a notebook and execute a cell containing: 1+1. What happens?
- print 2 in the output
Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
- done
Complete the Jupyter Notebook online appendix.
- done
Why is it hard to use a traditional computer program to recognize images in a photo?
- it is very hard to generalize a sequence of steps to instruct a computer to recognize an image
What did Samuel mean by "Weight Assignment"?
- to choose the values to update the weights,setting the values of the weights of the model that will produce the correct results given the inputs
What term do we normally use in deep learning for what Samuel called "Weights"?
- parameters
Draw a picture that summarizes Arthur Samuel's view of a machine learning model
- inputs + weights->model->outputs + labels -> performance -(update)-> weights¹
Why is it hard to understand why a deep learning model makes a particular prediction?
- because of the huge number of parameters that make up a model, its very hard to attribute any one particular factor that lead a model to make a prediction and cannot always be visualized or explained, unlike an algorithm or a decision tree.
What is the name of the theorem that a neural network can solve any mathematical problem to any level of accuracy?
- Universal Approximation theorem
What do you need in order to train a model?
- labelled input data
- predictions which are output by model
- loss function - compares prediction (aka predicted labels) vs actual label
- a way to update the model based on the loss function (aka backprop, aka SGD)
How could a feedback loop impact the rollout of a predictive policing model?
- when it start optimizing towards the proxy measurement instead of the real goal, i.e. it starts optimizing arrests in lieu of reducing crime
- bias in data could lead to biased models which make biased predictions
- biased predictions when applied to policing environment results in generating more biased data
- the updated biased data then is used to update the already biased model.
- the model seems to become "more accurate" but it has nothing to do with effectivity with the goal of reducing crime.
Do we always have to use 224x224 pixel images with the cat recognition model?
- no, its because we use models that have been trained traditionally on 224x224 images
What is the difference between classification and regression?
- classification - where the predicted output is a limited set of categories vs regression where the predicted output is a continuous set of values
What is a validation set? What is a test set? Why do we need them?
- validation set is a set of labelled data set aside from the data used to train a model in order to predict the performance of the model when deployed to production. As the model is tweaked in order to improve its performance vis-a-vis the validation set, the validation set becomes less and less a good predictor of its eventual production performance. As a guard against this, another set of data, the test set, is also set aside, but is never used during the tweaking process. Its function is only to provide a final predictor (after tweaking) of the model's production performance.
What will fastai do if you don't provide a validation set?
- fastai will create a random split of the data provided and set aside 20 percent as the validation set
Can we always use a random sample for a validation set? Why or why not?
- not always. if the data between the validation and training sets are correlated in some way that is not the case with the input used during production, then, the validation set is not representative of the actual production data and may provide misleading metrics as to its eventual performance.
What is overfitting? Provide an example.
- overfitting is when the model has optimized itself (decreasing loss against the training data) to a level to match the training data at the expense of generality, as indicated by its worsening performance against the validation set. When a model "learns" the training data to a degree such that it becomes optimized only for that data that is used during its training and will do worse on data that it has not "seen" (e.g. validation or test sets). For example, a face recognition model that can recognize faces only on the faces used for its training, but cannot recognize faces not used during its training.
What is a metric? How does it differ to "loss"?
- a metric is a problem specific measurement of the model's performance -- it is the performance that matters from the perspective of a model's task goal. A loss is a measurement of performance used to adjust the parameters during its training.
How can pretrained models help?
- pretrained models reduce the amount of computing resources and amount of data needed to build useful models by providing a set of weights that are better than random. By using the pre-existing parameters of the lower layers (which reflect commonalities in the different ML tasks which allow the pretrained models to be re-used), only the parameters head and upper layers need to be adjusted in order to achieve useful performance for the particular task, while the lower layer parameters need no or very fine adjustments.
What is the "head" of a model?
- the head of the model are the last few layers of the models that are specific to the task goal of the model
What kinds of features do the early layers of a CNN find? How about the later layers?
- the early layers find the simple generic features of all images, while the later layers combine these generic features into more complex specific features for a particular task.
Are image models only useful for photos?
- any set of data that can have some meaningful spatial relationships can be converted into images that can reuse image models - examples are audio, mouse movements, etc.
What is an "architecture"?
- an architecture is the pattern of interconnections between layers of neurons in a neural network
What is segmentation?
- segmentation is an image task of assigning a category for each pixel in an image
What is y_range used for? When do we need it?
- the y_range is used to specify the range of possible continuous values for the output. It is used during regression tasks
What are "hyperparameters"?
- the hyperparameters are the values that are adjusted during model training in order to build a useful model as quickly as possible
What's the best way to avoid failures when using AI in an organization?
- create a validation or test set in order to evaluate the performance of any proposed AI solution in order to have a good prediction of the actual deployed model's performance.By setting aside data for an independent evaluation of the model developed for a particular task separate from the data provided for building the model, they can really develop a good sense of the future performance of the model in the "real world", separate from any evaluation by the provider of the model.

Further research

Each chapter also has a "further research" with questions that aren't fully answered in the text, or include more advanced assignments. Answers to these questions aren't on the book website--you'll need to do your own research!

Why is a GPU useful for deep learning? How is a CPU different, and why is it less effective for deep learning?
- A GPU can massively parallelize mathematical operations such as matrix multiplication needed to perform gradient descent, enabling deep networks to run adequately whereas a CPU would bottleneck on these matrix operations, slowing it down.
Try to think of three areas where feedback loops might impact use of machine learning. See if you can find documented examples of that happening in practice.
- Youtube recommendations that optimize user engagement do become addictive and can create filter bubbles that while engaging, can radicalize the user's world view because the recommendations tend to reinforce biases which are also the most likely to keep the user engaged. Same thing with facebook feed recommendations. A third example would be the predictive policing model COMPAS which has now been criticized to be no better than random people at predicting crime.

Footnotes

1. Samuel's view of a machine learning mode ↩