Ask Sawal

Discussion Forum
Notification Icon1
Write Answer Icon
Add Question Icon

How to module 'torch' has no attribute 'square' (Python Programing Language)

5 Answer(s) Available
Answer # 1 #
1
>>> a = torch.randn(4)
2
>>> a
3
tensor([-2.0755,  1.0226,  0.0831,  0.4806])
4
>>> torch.square(a)
5
tensor([ 4.3077,  1.0457,  0.0069,  0.2310])
[9]
Edit
Query
Report
Joyia Sonal
QUILLER OPERATOR
Answer # 2 #

PyGAD has a module called pygad.kerasga. It trains Keras models using the genetic algorithm. On January 3rd, 2021, a new release of PyGAD 2.10.0 brought a new module called pygad.torchga to train PyTorch models. It’s very easy to use, but there are a few tricky steps.

So, in this tutorial, we’ll explore how to use PyGAD to train PyTorch models.

Let’s get started.

PyGAD is a Python 3 library, available at PyPI (Python Package Index). So, you can install it simply using this pip command:

Make sure you’re getting at least version 2.10.0, earlier ones don’t support the pygad.torchga module.

You can also download the wheel distribution file of PyGAD 2.10.0 from this link, and install it with the following command (make sure the current directory is set to the directory with the .whl file).

After PyGAD is installed, it’s time to start with the pygad.torchga module.

To learn more about PyGAD, please read its documentation at Read the Docs. You can also access the documentation of the pygad.torchga module directly through this link.

PyGAD 2.10.0 lets us train PyTorch models using the genetic algorithm (GA). The problem of training a PyTorch model is formulated to the GA as an optimization problem, where all the parameters in the model (e.g. weights and biases) are represented as a single vector (i.e. chromosome).

The pygad.torchga module (torchga is short for Torch Genetic Algorithm) helps us formulate the PyTorch model training problem the way PyGAD expects it. The module has 1 class and 2 functions:

The source code of the pygad.torchga module is available at the ahmedfgad/TorchGA GitHub project.

The constructor of the TorchGA class accepts the following 2 arguments:

Each of these arguments is used as an attribute in the instances of the pygad.torchga.TorchGA class. This means you can access the model by using the model attribute as follows:

There is a third attribute called population_weights, which is a 2D list of all solutions in the population. Remember that each solution is a 1D list holding the model’s parameters.

Here’s an example of creating an instance of the TorchGA class. The model argument can be assigned to any PyTorch model. The value passed to the num_solutions argument is 10, which means there are 10 solutions in the population.

The constructor of the TorchGA class calls a method called create_population() which creates and returns a population of solutions to the PyTorch model. At first, the model_weights_as_vector() function is called to return model parameters as a vector.

This vector is used to create solutions in the population. To make a difference between the solutions, random values are added to the vector.

Assuming that the model has 30 parameters, then the shape of the population_weights array is 10×30.

Now, let’s go over the steps needed to train a PyTorch model using PyGAD.

To train a PyTorch model using PyGAD, we need to go through these steps:

We’ll discuss each step in detail.

It’s important to decide whether the type of problem being solved by the PyTorch model is classification or regression. This will help us prepare:

For the loss functions offered by PyTorch, check this link. Examples of loss functions for regression problems include mean absolute error (nn.L1Loss) and mean square error (nn.MSELoss).

For a classification problem, some examples are binary cross-entropy (nn.BCELoss) for binary classification and cross-entropy (nn.CrossEntropyLoss) for multi-class problems.

Based on whether the problem is classification or regression, we can decide the activation function in the output layer. For example, softmax is for classification, linear is for regression.

The training data also depends on the problem type. If the problem is classification, then the output comes from a set of finite discrete values. If the problem is regression, then the output comes from a set of infinite continuous values.

We’ll do an example of building a PyTorch model, using the torch.nn module, to solve a simple regression problem. The model has 3 layers:

If the problem is classification, we must add an appropriate output layer, like SoftMax.

Finally, the model is created as an instance of the torch.nn.Sequential class, which accepts all the layers previously created in order.

We won’t go in-depth about how to build PyTorch models. For more details, you can check the PyTorch documentation.

Now, we’ll create an initial population of PyTorch model’s parameters using the pygad.torchga.TorchGA class.

Using the TorchGA class, PyGAD offers a simple interface to create an initial population of solutions to the PyTorch model. Just create an instance of pygad.torchga.TorchGA class, and an initial population will be created automatically.

Here is an example that passes the previously created model to the constructor of the TorchGA class.

Now let’s create random training data to train the model.

Based on whether the problem is classification or regression, we prepare the training data accordingly.

Here are 5 random samples, where each sample has 3 inputs and 1 output.

If we’re solving a binary classification problem like XOR, then its data is given below, where there are 4 samples with 2 inputs and 1 output.

Time for the loss function for regression and classification problems.

For a regression problem, loss functions include:

For a classification problem, the loss functions include:

Check this page for more information about loss functions in PyTorch.

Here’s an example of calculating binary cross-entropy using the torch.nn.BCELoss class. The detach() method is called to detach the tensor from the graph, in order to return its value. Check this link for more information about the detach() method.

The fitness function is then computed based on the calculated loss.

The genetic algorithm expects the fitness function to be a maximization one, where the higher its output, the better the result. However, calculating the loss for machine learning models is based on a minimization loss function. The lower the loss, the better the result.

If the fitness is set equal to the loss, then the genetic algorithm will search in the direction that makes the fitness increase. Thus, it will go in the opposite direction that increases the loss. This is why the fitness is calculated as the inverse of the loss according to the next line.

The small value 0.00000001 is added to avoid dividing by zero when loss=0.0.

When training PyTorch models using PyGAD, there are multiple solutions and each solution is a vector that holds all the parameters of the model.

To build the fitness function, follow these steps:

Next, we’ll build the fitness function for regression and binary classification problems.

The fitness function in PyGAD is built as a regular Python function, but it must accept 2 arguments representing:

The solution passed to the fitness function is a 1D vector. This vector can’t be used directly for the parameters of the PyTorch model, as the model expects parameters in the form of a dictionary. So, before calculating the loss, we need to convert the vector into a dictionary. We can use the model_weights_as_dict() function in the pygad.torchga module, as follows:

Once the dictionary of parameters is created, then the load_state_dict() method is called to use the parameters in this dictionary as the current parameters of the model.

According to the current parameters, the model makes predictions to the training data.

The model’s predictions are passed to the loss function to calculate the solution’s loss. The mean absolute error is used as the loss function.

Finally, the fitness value is returned.

Here is the fitness function for a binary classification problem. The loss function used is binary cross-entropy.

The created fitness function should be assigned to the fitness_func argument in the pygad.GA class’s constructor.

Next, we’ll build a callback function executed at the end of each generation.

According to the PyGAD lifecycle shown in the figure below, there’s a callback function that’s called after each generation. This function could be implemented and used to print some debugging information, like the best fitness value in each generation, and the number of completed generations. Note that this step is optional and for debugging purposes only.

All you need to do is to implement the callback function, and then assign it to the on_generation argument in the constructor of the pygad.GA class. Here is the callback function which accepts a single argument representing the instance of the pygad.GA class.

Using this instance, the attribute generations_completed is returned, and it holds the number of completed generations. The best_solution() method is also called, it returns information about the best solution in the current generation.

The next step is creating an instance of the pygad.GA class, responsible for running the genetic algorithm to train the PyTorch model.

The constructor of the pygad.GA class accepts many arguments that can be explored in the documentation. Using just some of those arguments, the next code creates an instance of the pygad.GA class and saves it in the ga_instance variable:

Note that the number of solutions within the population was previously set to 10 in the constructor of the TorchGA class. Thus, the number of parents to mate must be less than 10.

In the next section, we call the run() method to run the genetic algorithm and train the PyTorch model.

The ga_instance of pygad.GA can now call the run() method to start the genetic algorithm.

After this method completes, we can make predictions using the best solution found by the genetic algorithm in the last generation.

There’s a useful method called plot_result() in the pygad.GA class, it shows a figure relating the fitness value to the generation number. It’s useful after the run() method completes.

The pygad.GA class has a method called best_solution() which returns 3 outputs:

The next code calls the best_solution() method and prints information about the best solution.

The best solution’s parameters can be converted into a dictionary that’s fed into the PyTorch model for making predictions.

The next code calculates the loss after the model is trained.

After covering all the steps to build and train PyTorch models using PyGAD, next we’ll check out 2 examples with complete code.

For a regression problem that uses the mean absolute error as a loss function, here is the complete code.

The next figure is the result of calling the plot_result() method. It shows fitness value change by generation.

Here are the outputs of the print statements in the code. The MAE is 0.0069.

The next code builds a convolutional neural network (CNN) using PyTorch for classifying a dataset of 80 images, where the size of each image is 100x100x3. Cross-entropy loss is used in this example because there are more than 2 classes.

Training data can be downloaded from these links:

The next figure is the result of calling the plot_result() method. It shows fitness value change by generation.

Here’s some information about the trained model.

We explored how to train PyTorch models with the genetic algorithm using a Python 3 library called PyGAD.

PyGAD has a module torchga, which helps to formulate the problem of training PyTorch models as an optimization problem for the genetic algorithm. The torchga module creates an initial population of PyTorch model’s parameters, where each solution holds a different set of parameters for the model. Using PyGAD, the solutions in the population are evolved.

It’s a great way to play around with genetic algorithms. Try it, experiment a bit, and see what comes up!

[4]
Edit
Query
Report
Hrvoje Kothari
SLURRY CONTROL TENDER
Answer # 3 #

PyTorch is the fastest growing Deep Learning framework and it is also used by Fast.ai in its MOOC, Deep Learning for Coders and its library.

PyTorch is also very pythonic, meaning, it feels more natural to use it if you already are a Python developer.

Besides, using PyTorch may even improve your health, according to Andrej Karpathy :-)

There are many many PyTorch tutorials around and its documentation is quite complete and extensive. So, why should you keep reading this step-by-step tutorial?

Well, even though one can find information on pretty much anything PyTorch can do, I missed having a structured, incremental and from first principles approach to it.

In this post, I will guide you through the main reasons why PyTorch makes it much easier and more intuitive to build a Deep Learning model in Python — autograd, dynamic computation graph, model classes and more — and I will also show you how to avoid some common pitfalls and errors along the way.

Moreover, since this is quite a long post, I built a Table of Contents to make navigation easier, should you use it as a mini-course and work your way through the content one topic at a time.

Most tutorials start with some nice and pretty image classification problem to illustrate how to use PyTorch. It may seem cool, but I believe it distracts you from the main goal: how PyTorch works?

For this reason, in this tutorial, I will stick with a simple and familiar problem: a linear regression with a single feature x! It doesn’t get much simpler than that…

Let’s start generating some synthetic data: we start with a vector of 100 points for our feature x and create our labels using a = 1, b = 2 and some Gaussian noise.

Next, let’s split our synthetic data into train and validation sets, shuffling the array of indices and using the first 80 shuffled points for training.

We know that a = 1 and b = 2, but now let’s see how close we can get to the true values by using gradient descent and the 80 points in the training set…

If you are comfortable with the inner workings of gradient descent, feel free to skip this section. It goes beyond the scope of this post to fully explain how gradient descent works, but I’ll cover the four basic steps you’d need to go through to compute it.

For a regression problem, the loss is given by the Mean Square Error (MSE), that is, the average of all squared differences between labels (y) and predictions (a + bx).

A gradient is a partial derivative — why partial? Because one computes it with respect to (w.r.t.) a single parameter. We have two parameters, a and b, so we must compute two partial derivatives.

A derivative tells you how much a given quantity changes when you slightly vary some other quantity. In our case, how much does our MSE loss change when we vary each one of our two parameters?

The right-most part of the equations below is what you usually see in implementations of gradient descent for a simple linear regression. In the intermediate step, I show you all elements that pop-up from the application of the chain rule, so you know how the final expression came to be.

In the final step, we use the gradients to update the parameters. Since we are trying to minimize our losses, we reverse the sign of the gradient for the update.

There is still another parameter to consider: the learning rate, denoted by the Greek letter eta (that looks like the letter n), which is the multiplicative factor that we need to apply to the gradient for the parameter update.

How to choose a learning rate? That is a topic on its own and beyond the scope of this post as well.

Now we use the updated parameters to go back to Step 1 and restart the process.

Repeating this process over and over, for many epochs, is, in a nutshell, training a model.

It’s time to implement our linear regression model using gradient descent using Numpy only.

Yes, it is, but this serves two purposes: first, to introduce the structure of our task, which will remain largely the same and, second, to show you the main pain points so you can fully appreciate how much PyTorch makes your life easier :-)

For training a model, there are two initialization steps:

Make sure to always initialize your random seed to ensure reproducibility of your results. As usual, the random seed is 42, the least random of all random seeds one could possibly choose :-)

For each epoch, there are four training steps:

Just keep in mind that, if you don’t use batch gradient descent (our example does),you’ll have to write an inner loop to perform the four training steps for either each individual point (stochastic) or n points (mini-batch). We’ll see a mini-batch example later down the line.

Just to make sure we haven’t done any mistakes in our code, we can use Scikit-Learn’s Linear Regression to fit the model and compare the coefficients.

They match up to 6 decimal places — we have a fully working implementation of linear regression using Numpy.

Time to TORCH it :-)

First, we need to cover a few basic concepts that may throw you off-balance if you don’t grasp them well enough before going full-force on modeling.

In Deep Learning, we see tensors everywhere. Well, Google’s framework is called TensorFlow for a reason! What is a tensor, anyway?

In Numpy, you may have an array that has three dimensions, right? That is, technically speaking, a tensor.

A scalar (a single number) has zero dimensions, a vector has one dimension, a matrix has two dimensions and a tensor has three or more dimensions. That’s it!

But, to keep things simple, it is commonplace to call vectors and matrices tensors as well — so, from now on, everything is either a scalar or a tensor.

”How do we go from Numpy’s arrays to PyTorch’s tensors”, you ask? That’s what from_numpy is good for. It returns a CPU tensor, though.

“But I want to use my fancy GPU…”, you say. No worries, that’s what to() is good for. It sends your tensor to whatever device you specify, including your GPU (referred to as cuda or cuda:0).

“What if I want my code to fallback to CPU if no GPU is available?”, you may be wondering… PyTorch got your back once more — you can use cuda.is_available() to find out if you have a GPU at your disposal and set your device accordingly.

You can also easily cast it to a lower precision (32-bit float) using float().

If you compare the types of both variables, you’ll get what you’d expect: numpy.ndarray for the first one and torch.Tensor for the second one.

But where does your nice tensor “live”? In your CPU or your GPU? You can’t say… but if you use PyTorch’s type(), it will reveal its location — torch.cuda.FloatTensor — a GPU tensor in this case.

We can also go the other way around, turning tensors back into Numpy arrays, using numpy(). It should be easy as x_train_tensor.numpy() but…

Unfortunately, Numpy cannot handle GPU tensors… you need to make them CPU tensors first using cpu().

What distinguishes a tensor used for data — like the ones we’ve just created — from a tensor used as a (trainable) parameter/weight?

The latter tensors require the computation of its gradients, so we can update their values (the parameters’ values, that is). That’s what the requires_grad=True argument is good for. It tells PyTorch we want it to compute gradients for us.

You may be tempted to create a simple tensor for a parameter and, later on, send it to your chosen device, as we did with our data, right? Not so fast…

The first chunk of code creates two nice tensors for our parameters, gradients and all. But they are CPU tensors.

In the second chunk of code, we tried the naive approach of sending them to our GPU. We succeeded in sending them to another device, but we ”lost” the gradients somehow…

In the third chunk, we first send our tensors to the device and then use requires_grad_() method to set its requires_grad to True in place.

Although the last approach worked fine, it is much better to assign tensors to a device at the moment of their creation.

Much easier, right?

Now that we know how to create tensors that require gradients, let’s see how PyTorch handles them — that’s the role of the…

Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need to worry about partial derivatives, chain rule or anything like it.

So, how do we tell PyTorch to do its thing and compute all gradients? That’s what backward() is good for.

Do you remember the starting point for computing the gradients? It was the loss, as we computed its partial derivatives w.r.t. our parameters. Hence, we need to invoke the backward() method from the corresponding Python variable, like, loss.backward().

What about the actual values of the gradients? We can inspect them by looking at the grad attribute of a tensor.

If you check the method’s documentation, it clearly states that gradients are accumulated. So, every time we use the gradients to update the parameters, we need to zero the gradients afterwards. And that’s what zero_() is good for.

What does the underscore (_) at the end of the method name mean? Do you remember? If not, scroll back to the previous section and find out.

So, let’s ditch the manual computation of gradients and use both backward() and zero_() methods instead.

That’s it? Well, pretty much… but, there is always a catch, and this time it has to do with the update of the parameters…

In the first attempt, if we use the same update structure as in our Numpy code, we’ll get the weird error below… but we can get a hint of what’s going on by looking at the tensor itself — once again we “lost” the gradient while reassigning the update results to our parameters. Thus, the grad attribute turns out to be None and it raises the error…

We then change it slightly, using a familiar in-place Python assignment in our second attempt. And, once again, PyTorch complains about it and raises an error.

So, how do we tell PyTorch to “back off” and let us update our parameters without messing up with its fancy dynamic computation graph? That’s what torch.no_grad() is good for. It allows us to perform regular Python operations on tensors, independent of PyTorch’s computation graph.

Finally, we managed to successfully run our model and get the resulting parameters. Surely enough, they match the ones we got in our Numpy-only implementation.

How great was “The Matrix”? Right, right? But, jokes aside, I want you to see the graph for yourself too!

So, let’s stick with the bare minimum: two (gradient computing) tensors for our parameters, predictions, errors and loss.

If we call make_dot(yhat) we’ll get the left-most graph on Figure 3 below:

Let’s take a closer look at its components:

If we plot graphs for the error (center) and loss (right) variables, the only difference between them and the first one is the number of intermediate steps (gray boxes).

Now, take a closer look at the green box of the left-most graph: there are two arrows pointing to it, since it is adding up two variables, a and b*x. Seems obvious, right?

Then, look at the gray box of the same graph: it is performing a multiplication, namely, b*x. But there is only one arrow pointing to it! The arrow comes from the blue box that corresponds to our parameter b.

Why don’t we have a box for our data x? The answer is: we do not compute gradients for it! So, even though there are more tensors involved in the operations performed by the computation graph, it only shows gradient-computing tensors and its dependencies.

What would happen to the computation graph if we set requires_grad to False for our parameter a?

Unsurprisingly, the blue box corresponding to the parameter a is no more! Simple enough: no gradients, no graph.

The best thing about the dynamic computing graph is the fact that you can make it as complex as you want it. You can even use control flow statements (e.g., if statements) to control the flow of the gradients (obviously!) :-)

Figure 5 below shows an example of this. And yes, I do know that the computation itself is completely nonsense…

So far, we’ve been manually updating the parameters using the computed gradients. That’s probably fine for two parameters… but what if we had a whole lot of them?! We use one of PyTorch’s optimizers, like SGD or Adam.

An optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other hyper-parameters as well!) and performs the updates through its step() method.

Besides, we also don’t need to zero the gradients one by one anymore. We just invoke the optimizer’s zero_grad() method and that’s it!

In the code below, we create a Stochastic Gradient Descent (SGD) optimizer to update our parameters a and b.

Let’s check our two parameters, before and after, just to make sure everything is still working fine:

Cool! We’ve optimized the optimization process :-) What’s left?

We now tackle the loss computation. As expected, PyTorch got us covered once again. There are many loss functions to choose from, depending on the task at hand. Since ours is a regression, we are using the Mean Square Error (MSE) loss.

We then use the created loss function later, at line 20, to compute the loss given our predictions and our labels.

Our code looks like this now:

At this point, there’s only one piece of code left to change: the predictions. It is then time to introduce PyTorch’s way of implementing a…

In PyTorch, a model is represented by a regular Python class that inherits from the Module class.

The most fundamental methods it needs to implement are:

Let’s build a proper (yet simple) model for our regression task. It should look like this:

In the __init__ method, we define our two parameters, a and b, using the Parameter() class, to tell PyTorch these tensors should be considered parameters of the model they are an attribute of.

Why should we care about that? By doing so, we can use our model’s parameters() method to retrieve an iterator over all model’s parameters, even those parameters of nested models, that we can use to feed our optimizer (instead of building a list of parameters ourselves!).

Moreover, we can get the current values for all parameters using our model’s state_dict() method.

We can use all these handy methods to change our code, which should be looking like this:

Now, the printed statements will look like this — final values for parameters a and b are still the same, so everything is ok :-)

I hope you noticed one particular statement in the code, to which I assigned a comment “What is this?!?” — model.train().

In our model, we manually created two parameters to perform a linear regression. Let’s use PyTorch’s Linear model as an attribute of our own, thus creating a nested model.

Even though this clearly is a contrived example, as we are pretty much wrapping the underlying model without adding anything useful (or, at all!) to it, it illustrates well the concept.

In the __init__ method, we created an attribute that contains our nested Linear model.

In the forward() method, we call the nested model itself to perform the forward pass (notice, we are not calling self.linear.forward(x)!).

Now, if we call the parameters() method of this model, PyTorch will figure the parameters of its attributes in a recursive way. You can try it yourself using something like: to get a list of all parameters. You can also add new Linear attributes and, even if you don’t use them at all in the forward pass, they will still be listed under parameters().

Our model was simple enough… You may be thinking: “why even bother to build a class for it?!” Well, you have a point…

For straightforward models, that use run-of-the-mill layers, where the output of a layer is sequentially fed as an input to the next, we can use a, er… Sequential model :-)

In our case, we would build a Sequential model with a single argument, that is, the Linear layer we used to train our linear regression. The model would look like this:

Simple enough, right?

So far, we’ve defined an optimizer, a loss function and a model. Scroll up a bit and take a quick look at the code inside the loop. Would it change if we were using a different optimizer, or loss, or even model? If not, how can we make it more generic?

Well, I guess we could say all these lines of code perform a training step, given those three elements (optimizer, loss and model),the features and the labels.

So, how about writing a function that takes those three elements and returns another function that performs a training step, taking a set of features and labels as arguments and returning the corresponding loss?

Then we can use this general-purpose function to build a train_step() function to be called inside our training loop. Now our code should look like this… see how tiny the training loop is now?

Let’s give our training loop a rest and focus on our data for a while… so far, we’ve simply used our Numpy arrays turned PyTorch tensors. But we can do better, we can build a…

In PyTorch, a dataset is represented by a regular Python class that inherits from the Dataset class. You can think of it as a kind of a Python list of tuples, each tuple corresponding to one point (features, label).

The most fundamental methods it needs to implement are:

Let’s build a simple custom dataset that takes two tensors as arguments: one for the features, one for the labels. For any given index, our dataset class will return the corresponding slice of each of those tensors. It should look like this:

Once again, you may be thinking “why go through all this trouble to wrap a couple of tensors in a class?”. And, once again, you do have a point… if a dataset is nothing else but a couple of tensors, we can use PyTorch’s TensorDataset class, which will do pretty much what we did in our custom dataset above.

OK, fine, but then again, why are we building a dataset anyway? We’re doing it because we want to use a…

Until now, we have used the whole training data at every training step. It has been batch gradient descent all along. This is fine for our ridiculously small dataset, sure, but if we want to go serious about all this, we must use mini-batch gradient descent. Thus, we need mini-batches. Thus, we need to slice our dataset accordingly. Do you want to do it manually?! Me neither!

So we use PyTorch’s DataLoader class for this job. We tell it which dataset to use (the one we just built in the previous section), the desired mini-batch size and if we’d like to shuffle it or not. That’s it!

Our loader will behave like an iterator, so we can loop over it and fetch a different mini-batch every time.

To retrieve a sample mini-batch, one can simply run the command below — it will return a list containing two tensors, one for the features, another one for the labels.

How does this change our training loop? Let’s check it out!

Two things are different now: not only we have an inner loop to load each and every mini-batch from our DataLoader but, more importantly, we are now sending only one mini-batch to the device.

So far, we’ve focused on the training data only. We built a dataset and a data loader for it. We could do the same for the validation data, using the split we performed at the beginning of this post… or we could use random_split instead.

PyTorch’s random_split() method is an easy and familiar way of performing a training-validation split. Just keep in mind that, in our example, we need to apply it to the whole dataset (not the training dataset we built in two sections ago).

Then, for each subset of data, we build a corresponding DataLoader, so our code looks like this:

Now we have a data loader for our validation set, so, it makes sense to use it for the…

This is the last part of our journey — we need to change the training loop to include the evaluation of our model, that is, computing the validation loss. The first step is to include another inner loop to handle the mini-batches that come from the validation loader , sending them to the same device as our model. Next, we make predictions using our model (line 23) and compute the corresponding loss (line 24).

That’s pretty much it, but there are two small, yet important, things to consider:

Now, our training loop should look like this:

Is there anything else we can improve or change? Sure, there is always something else to add to your model — using a learning rate scheduler, for instance. But this post is already waaaay too long, so I will stop right here.

“Where is the full working code with all bells and whistles?”, you ask? You can find it here.

Although this post was much longer than I anticipated when I started writing it, I wouldn’t make it any different — I believe it has most of the necessary steps one needs go to trough in order to learn, in a structured and incremental way, how to develop Deep Learning models using PyTorch.

[3]
Edit
Query
Report
Saptrishi Chowdhry
STENOTYPE OPERATOR
Answer # 4 #

The second point might be an uncommon opinion: If I Google "tracing vs scripting", the first article recommends scripting as default. But tracing has many advantages. In fact, by the time I left, "tracing as default, scripting only when necessary" is the strategy all detection & segmentation models in Facebook/Meta products are deployed.

Why tracing is better? TL;DR: (i) it will not damage the code quality; (ii) its main limitations can be addressed by mixing with scripting.

We start by disambiguate some common terminologies:

If anyone says "we'll make Python better by writing a compiler for it", you should immediately be alarmed and know that this is extremely difficult. Python is too big and too dynamic. A compiler can only support a subset of its syntax features and builtins, at best -- the scripting compiler in PyTorch is no exception.

What subset of Python does this compiler support? A rough answer is: the compiler has good support for the most basic syntax, but medium to no support for anything more complicated (classes, builtins like range and zip, dynamic types, etc.). But there is no clear answer: even the developers of the compiler usually need to run the code to see if it can be compiled or not.

The incomplete Python compiler limits how users can write code. Though there isn't a clear list of constraints, I can tell from my experience what impact they have had on large projects: code quality is the cost of scriptability.

To make their code scriptable / compilable by the scripting compiler, most projects choose to stay on the "safe side" to only use basic syntax of Python: no/few custom structures, no builtins, no inheritance, no Union, no **kwargs, no lambda, no dynamic types, etc.

This is because these "advanced" compiler features are either not supported at all, or with "partial support" which is not robust enough: they may work in some cases but fail in others. And because there is no clear spec of what is supported, users are unable to reason about or workaround the failures. Therefore, eventually users move to and stay on the safe side.

The terrible consequence is that: developers stop making abstractions / exploring useful language features due to concerns in scriptability.

A related hack that many projects do is to rewrite part of the code for scripting: create a separate, inference-only forward codepath that makes the compiler happy. This also makes the project harder to maintain.

Detectron2 supports scripting, but the story was a bit different: it did not go downhill in code quality which we value a lot in research. Instead, with some creativity and direct support from PyTorch team (and some volunteered help from Alibaba engineers), we managed to make most models scriptable without removing any abstractions.

However, it is not an easy task: we had to add dozens of syntax fixes to the compiler, find creative workarounds, and develop some hacky patches in detectron2 that are in this file (which honestly could affect maintainability in the long term). I would not recommend other large projects to aim for "scriptability without losing abstractions" unless they are also closely supported by PyTorch team.

If you think "scripting seems to work for my project" so let's embrace it, I might advise against it for the following reasons, based on my past experiences with a few projects that support scripting:

Below is a complaint in PyTorch issues. The issue itself is just one small papercut of scripting, but similar complaints were heard many times. The status-quo is: scripting forces you to write ugly code, so only use it when necessary.

What it takes to make a model traceable is very clear, and has a much smaller impact on code health.

That's all it takes for traceability - most importantly, any Python syntax is allowed in model implementation, because tracing does not care about syntax at all.

Just being "traceable" is not sufficient. The biggest problem with tracing, is that it may not generalize to other inputs. This problem happens in the following cases:

The above problems are annoying and often silent (warnings, but no errors), but they can be successfully addressed by good practice and tools:

Tracing and scripting both have their own problems, and the best solution is usually to mix them together. This gives us the best of both worlds.

To minimize the negative impact on code quality, we should use tracing for the majority of logic, and use scripting only when necessary.

If a model is both traceable and scriptable, tracing always generates same or simpler graph (therefore likely faster).

Why? Because scripting tries to faithfully represent your Python code, even some of it are unnecessary. For example: it is not always smart enough to realize that some loops or data structures in the Python code are actually static and can be removed:

This example is very simple, so it actually has workarounds for scripting (use tuple instead of list), or the loop might get optimized in a later optimization pass. But the point is: the graph compiler is not always smart enough. For complicated models, scripting might generate a graph with unnecessary complexity that's hard to optimize.

Tracing has clear limitations: I spent most of this article talking about the limitations of tracing and how to fix them. I actually think this is the advantage of tracing: it has clear limitations (and solutions), so you can reason about whether it works.

On the contrary, scripting is more like a black box: no one knows if it works before trying. I didn't mention a single trick about how to fix scripting: there are many of them, but it's not worth your time to probe and fix a black box.

Tracing has small blast radius: Both tracing and scripting affect how code can be written, but tracing has a much smaller blast radius, and causes much less damage:

On the other hand, scripting has an impact on:

Having a large blast radius is why scripting can do great harm to code quality.

[3]
Edit
Query
Report
Sheeri Garfein
Telephone Maintainer
Answer # 5 #

Any directory without a __init__.py file present in it, located on your module search path, will be treated as a namespace, provided no other Python modules or packages by that name are found anywhere else along the search path.

This means that if torch was installed for your Python binary, it doesn't matter if there is a local torch directory:

The above shows that sys.path lists the torch directory first, followed by additional_path/torch, but the latter is loaded as the torch module when you try to import it. That's because Python gives priority to top-level modules and packages before loading a namespace package.

You need to install torch correctly for your current Python binary, see the project homepage; when using pip you may want to use the Python binary with the -m switch instead:

So replace the pip3 the homepage instructions use with python3.5 -m pip; python3.5 can also be the full path to your Python binary.

Do use the correct download.pytorch.org URL for the latest version.

You don't have to move the directory aside, but if you do want to and don't know where it is located, use print(torch.__path__) as I've shown above.

Again, note that if you do have an __init__.py file in a local torch directory, it becomes a regular package and it'll mask packages installed by pip into the normal site-packages location. If you have such a package, or a local torch.py single-file module, you need to rename those. The diagnostic information looks different in that case:

Note the differences; a namespace package, above, uses , while a regular package uses ), while a plain module uses`.

[1]
Edit
Query
Report
Gauri Acharya
Props and Lighting Technicians