From 5454bac9179612228bb385b5be8034832daa90d0 Mon Sep 17 00:00:00 2001
From: ArnoStrouwen <arno.strouwen@telenet.be>
Date: Mon, 9 Jan 2023 11:50:43 +0100
Subject: [PATCH] [skip ci] LanguageTool

---
 .gitignore                                    |  1 +
 docs/src/Benchmark.md                         | 12 +++---
 docs/src/examples/dae/physical_constraints.md | 10 ++---
 docs/src/examples/neural_ode/minibatch.md     |  8 ++--
 docs/src/examples/neural_ode/neural_gde.md    | 14 +++----
 .../examples/neural_ode/neural_ode_flux.md    |  8 ++--
 docs/src/examples/neural_ode/simplechains.md  | 12 +++---
 docs/src/examples/ode/exogenous_input.md      |  4 +-
 .../examples/ode/prediction_error_method.md   |  8 ++--
 .../src/examples/ode/second_order_adjoints.md |  2 +-
 docs/src/examples/ode/second_order_neural.md  |  2 +-
 docs/src/examples/pde/pde_constrained.md      | 12 +++---
 docs/src/examples/sde/SDE_control.md          |  6 +--
 docs/src/examples/sde/optimization_sde.md     |  6 +--
 docs/src/getting_started.md                   | 39 ++++++++++---------
 docs/src/index.md                             | 22 +++++------
 docs/src/sensitivity_math.md                  |  8 ++--
 .../adjoint_continuous_functional.md          |  8 ++--
 docs/src/tutorials/chaotic_ode.md             |  8 ++--
 docs/src/tutorials/data_parallel.md           | 18 ++++-----
 docs/src/tutorials/direct_sensitivity.md      |  6 +--
 .../src/tutorials/parameter_estimation_ode.md |  8 ++--
 .../src/tutorials/training_tips/divergence.md |  4 +-
 .../tutorials/training_tips/local_minima.md   | 14 +++----
 .../tutorials/training_tips/multiple_nn.md    |  2 +-
 25 files changed, 122 insertions(+), 120 deletions(-)

diff --git a/.gitignore b/.gitignore
index 3f02ca741..f3fc3a6b4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,3 +2,4 @@
 *.jl.*.cov
 *.jl.mem
 Manifest.toml
+/docs/build/
\ No newline at end of file
diff --git a/docs/src/Benchmark.md b/docs/src/Benchmark.md
index d7725d403..a78b675b3 100644
--- a/docs/src/Benchmark.md
+++ b/docs/src/Benchmark.md
@@ -2,13 +2,13 @@
 
 ## Note on benchmarking and getting the best performance out of the SciML stack's adjoints
 
-From our [recent papers](https://arxiv.org/abs/1812.01892) it's clear that `EnzymeVJP` is the fastest,
-especially when the program is setup to be fully non-allocating mutating functions. Thus for all benchmarking,
+From our [recent papers](https://arxiv.org/abs/1812.01892), it's clear that `EnzymeVJP` is the fastest,
+especially when the program is set up to be fully non-allocating mutating functions. Thus for all benchmarking,
 especially with PDEs, this should be done. Neural network libraries don't make use of mutation effectively
 [except for SimpleChains.jl](https://julialang.org/blog/2022/04/simple-chains/), so we recommend creating a
 neural ODE / universal ODE with `ZygoteVJP` and Flux first, but then check the correctness by moving the
 implementation over to SimpleChains and if possible `EnzymeVJP`. This can be an order of magnitude improvement
-(or more) in many situations over all of the previous benchmarks using Zygote and Flux, and thus it's
+(or more) in many situations over all the previous benchmarks using Zygote and Flux, and thus it's
 highly recommended in scenarios that require performance.
 
 ## Vs Torchdiffeq 1 million and less ODEs
@@ -23,10 +23,10 @@ A training benchmark using the spiral ODE from the original neural ODE paper
 
 ## Vs torchsde on small SDEs
 
-Using the code from torchsde's README we demonstrated a [>70,000x performance
+Using the code from torchsde's README, we demonstrated a [>70,000x performance
 advantage over torchsde](https://gist.github.com/ChrisRackauckas/6a03e7b151c86b32d74b41af54d495c6).
-Further benchmarking is planned but was found to be computationally infeasible
-for the time being.
+Further benchmarking is planned, but was found to be computationally infeasible
+at this time.
 
 ## A bunch of adjoint choices on neural ODEs
 
diff --git a/docs/src/examples/dae/physical_constraints.md b/docs/src/examples/dae/physical_constraints.md
index d9177b0f3..48a19ce22 100644
--- a/docs/src/examples/dae/physical_constraints.md
+++ b/docs/src/examples/dae/physical_constraints.md
@@ -4,7 +4,7 @@ As shown in the [stiff ODE tutorial](https://docs.sciml.ai/SciMLTutorialsOutput/
 differential-algebraic equations (DAEs) can be used to impose physical
 constraints. One way to define a DAE is through an ODE with a singular mass
 matrix. For example, if we make `Mu' = f(u)` where the last row of `M` is all
-zeros, then we have a constraint defined by the right hand side. Using
+zeros, then we have a constraint defined by the right-hand side. Using
 `NeuralODEMM`, we can use this to define a neural ODE where the sum of all 3
 terms must add to one. An example of this is as follows:
 
@@ -81,7 +81,7 @@ rng = Random.default_rng()
 
 ### Differential Equation
 
-First, we define our differential equations as a highly stiff problem which makes the
+First, we define our differential equations as a highly stiff problem, which makes the
 fitting difficult.
 
 ```@example dae2
@@ -118,7 +118,7 @@ all zeros)
 
 ### ODE Function, Problem and Solution
 
-We define and solve our ODE problem to generate the "labeled" data which will be used to
+We define and solve our ODE problem to generate the “labeled” data which will be used to
 train our Neural Network.
 
 ```@example dae2
@@ -127,7 +127,7 @@ prob_stiff = ODEProblem(stiff_func, u₀, tspan, p)
 sol_stiff = solve(prob_stiff, Rodas5(), saveat = 0.1)
 ```
 
-Because this is a DAE we need to make sure to use a **compatible solver**.
+Because this is a DAE, we need to make sure to use a **compatible solver**.
 `Rodas5` works well for this example.
 
 ### Neural Network Layers
@@ -163,7 +163,7 @@ end
 
 ### Train Parameters
 
-Training our network requires a **loss function**, an **optimizer** and a
+Training our network requires a **loss function**, an **optimizer**, and a
 **callback function** to display the progress.
 
 #### Loss
diff --git a/docs/src/examples/neural_ode/minibatch.md b/docs/src/examples/neural_ode/minibatch.md
index 36ab6d507..982583d7b 100644
--- a/docs/src/examples/neural_ode/minibatch.md
+++ b/docs/src/examples/neural_ode/minibatch.md
@@ -88,9 +88,9 @@ xlabel!("Time")
 ylabel!("Temp") 
 ```
 
-When training a neural network we need to find the gradient with respect to our data set. There are three main ways to partition our data when using a training algorithm like gradient descent: stochastic, batching and mini-batching. Stochastic gradient descent trains on a single random data point each epoch. This allows for the neural network to better converge to the global minimum even on noisy data but is computationally inefficient. Batch gradient descent trains on the whole data set each epoch and while computationally efficient is prone to converging to local minima. Mini-batching combines both of these advantages and by training on a small random "mini-batch" of the data each epoch can converge to the global minimum while remaining more computationally efficient than stochastic descent. Typically we do this by randomly selecting subsets of the data each epoch and use this subset to train on. We can also pre-batch the data by creating an iterator holding these randomly selected batches before beginning to train. The proper size for the batch can be determined experimentally. Let us see how to do this with Julia. 
+When training a neural network, we need to find the gradient with respect to our data set. There are three main ways to partition our data when using a training algorithm like gradient descent: stochastic, batching and mini-batching. Stochastic gradient descent trains on a single random data point each epoch. This allows for the neural network to better converge to the global minimum even on noisy data, but is computationally inefficient. Batch gradient descent trains on the whole data set each epoch and while computationally efficient is prone to converging to local minima. Mini-batching combines both of these advantages and by training on a small random "mini-batch" of the data each epoch can converge to the global minimum while remaining more computationally efficient than stochastic descent. Typically, we do this by randomly selecting subsets of the data each epoch and use this subset to train on. We can also pre-batch the data by creating an iterator holding these randomly selected batches before beginning to train. The proper size for the batch can be determined experimentally. Let us see how to do this with Julia. 
 
-For this example we will use a very simple ordinary differential equation, newtons law of cooling. We can represent this in Julia like so. 
+For this example, we will use a very simple ordinary differential equation, newtons law of cooling. We can represent this in Julia like so. 
 
 ```@example minibatch
 using DifferentialEquations, Flux, Random, Plots
@@ -156,7 +156,7 @@ for (x, y) in train_loader
 end
 ```
 
-Now we train the neural network with a user defined call back function to display loss and the graphs with a maximum of 300 epochs. 
+Now we train the neural network with a user-defined call back function to display loss and the graphs with a maximum of 300 epochs. 
 
 ```@example minibatch
 numEpochs = 300
@@ -176,7 +176,7 @@ opt=ADAM(0.05)
 Flux.train!(loss_adjoint, Flux.params(θ), ncycle(train_loader,numEpochs), opt, cb=Flux.throttle(cb, 10))
 ```
 
-Finally we can see how well our trained network will generalize to new initial conditions. 
+Finally, we can see how well our trained network will generalize to new initial conditions. 
 
 ```@example minibatch
 starting_temp=collect(10:30:250)
diff --git a/docs/src/examples/neural_ode/neural_gde.md b/docs/src/examples/neural_ode/neural_gde.md
index 3ba28d480..fccc086eb 100644
--- a/docs/src/examples/neural_ode/neural_gde.md
+++ b/docs/src/examples/neural_ode/neural_gde.md
@@ -2,7 +2,7 @@
 
 This tutorial has been adapted from [here](https://github.com/CarloLucibello/GraphNeuralNetworks.jl/blob/master/examples/neural_ode_cora.jl).
 
-In this tutorial we will use Graph Differential Equations (GDEs) to perform classification on the [CORA Dataset](https://relational.fit.cvut.cz/dataset/CORA). We shall be using the Graph Neural Networks primitives from the package [GraphNeuralNetworks](https://github.com/CarloLucibello/GraphNeuralNetworks.jl).
+In this tutorial, we will use Graph Differential Equations (GDEs) to perform classification on the [CORA Dataset](https://relational.fit.cvut.cz/dataset/CORA). We shall be using the Graph Neural Networks primitives from the package [GraphNeuralNetworks](https://github.com/CarloLucibello/GraphNeuralNetworks.jl).
 
 ```julia
 # Load the packages
@@ -184,7 +184,7 @@ Ã = normalized_adjacency(g, add_self_loops=true) |> device
 ```
 ### Training Data
 
-GNNs operate on an entire graph, so we can't do any sort of minibatching here. We predict the entire dataset but train the model in a semi-supervised learning fashion.
+GNNs operate on an entire graph, so we can't do any sort of minibatching here. We predict the entire dataset, but train the model in a semi-supervised learning fashion.
 ```julia
 (; train_mask, val_mask, test_mask) = g.ndata
 ytrain = y[:,train_mask]
@@ -202,7 +202,7 @@ epochs = 20
 ```
 ## Define the Graph Neural Network
 
-Here we define a type of graph neural networks called `GCNConv`. We use the name `ExplicitGCNConv` to avoid naming conflicts with `GraphNeuralNetworks`. For more informations on defining a layer with `Lux`, please consult to the [doc](http://lux.csail.mit.edu/dev/introduction/overview/#AbstractExplicitLayer-API).
+Here, we define a type of graph neural networks called `GCNConv`. We use the name `ExplicitGCNConv` to avoid naming conflicts with `GraphNeuralNetworks`. For more information on defining a layer with `Lux`, please consult to the [doc](http://lux.csail.mit.edu/dev/introduction/overview/#AbstractExplicitLayer-API).
 
 
 ```julia
@@ -240,7 +240,7 @@ end
 
 ## Neural Graph Ordinary Differential Equations
 
-Let us now define the final model. We will use two GNN layers for approximating the gradients for the neural ODE. We use one additional `GCNConv` layer to project the data to a latent space and the a `Dense` layer to project it from the latent space to the predictions. Finally a softmax layer gives us the probability of the input belonging to each target category.
+Let us now define the final model. We will use two GNN layers for approximating the gradients for the neural ODE. We use one additional `GCNConv` layer to project the data to a latent space and a `Dense` layer to project it from the latent space to the predictions. Finally, a softmax layer gives us the probability of the input belonging to each target category.
 
 ```julia
 function diffeqsol_to_array(x::ODESolution{T, N, <:AbstractVector{<:CuArray}}) where {T, N}
@@ -264,7 +264,7 @@ model = Chain(ExplicitGCNConv(Ã, nin => nhidden, relu),
 
 ### Loss Function and Accuracy
 
-We shall be using the standard categorical crossentropy loss function which is used for multiclass classification tasks.
+We shall be using the standard categorical crossentropy loss function, which is used for multiclass classification tasks.
 
 ```julia
 logitcrossentropy(ŷ, y) = mean(-sum(y .* logsoftmax(ŷ); dims=1))
@@ -283,7 +283,7 @@ end
 ```
 
 ### Setup Model
-We need to manually set up our mode with `Lux`, and convert the paramters to `ComponentArray` so that they can work well with sensitivity algorithms.
+We need to manually set up our mode with `Lux`, and convert the parameters to `ComponentArray` so that they can work well with sensitivity algorithms.
 ```julia
 rng = Random.default_rng()
 Random.seed!(rng, 0)
@@ -294,7 +294,7 @@ st = st |> device
 ```
 ### Optimizer
 
-For this task we will be using the `ADAM` optimizer with a learning rate of `0.01`.
+For this task, we will be using the `ADAM` optimizer with a learning rate of `0.01`.
 
 ```julia
 opt = Optimisers.Adam(0.01f0)
diff --git a/docs/src/examples/neural_ode/neural_ode_flux.md b/docs/src/examples/neural_ode/neural_ode_flux.md
index 370aa569c..517746b7b 100644
--- a/docs/src/examples/neural_ode/neural_ode_flux.md
+++ b/docs/src/examples/neural_ode/neural_ode_flux.md
@@ -1,6 +1,6 @@
 # Neural Ordinary Differential Equations with Flux
 
-All of the tools of SciMLSensitivity.jl can be used with Flux.jl. A lot of the examples
+All the tools of SciMLSensitivity.jl can be used with Flux.jl. A lot of the examples
 have been written to use `FastChain` and `sciml_train`, but in all cases this
 can be changed to the `Chain` and `Flux.train!` workflow.
 
@@ -74,10 +74,10 @@ p,re = Flux.destructure(chain)
 
 returns `p` which is the vector of parameters for the chain and `re` which is
 a function `re(p)` that reconstructs the neural network with new parameters
-`p`. Using this function we can thus build our neural differential equations in
+`p`. Using this function, we can thus build our neural differential equations in
 an explicit parameter style.
 
-Let's use this to build and train a neural ODE from scratch. In this example we will
+Let's use this to build and train a neural ODE from scratch. In this example, we will
 optimize both the neural network parameters `p` and the input initial condition `u0`.
 Notice that Optimization.jl works on a vector input, so we have to concatenate `u0`
 and `p` and then in the loss function split to the pieces.
@@ -149,7 +149,7 @@ result_neuralode2 = Optimization.solve(optprob2,
 ```
 
 Notice that the advantage of this format is that we can use Optim's optimizers, like
-`LBFGS` with a full `Chain` object for all of Flux's neural networks, like
+`LBFGS` with a full `Chain` object, for all of Flux's neural networks, like
 convolutional neural networks.
 
 ![](https://user-images.githubusercontent.com/1814174/51399500-1f4dd080-1b14-11e9-8c9d-144f93b6eac2.gif)
diff --git a/docs/src/examples/neural_ode/simplechains.md b/docs/src/examples/neural_ode/simplechains.md
index 133ed4cac..50e7580e6 100644
--- a/docs/src/examples/neural_ode/simplechains.md
+++ b/docs/src/examples/neural_ode/simplechains.md
@@ -1,10 +1,10 @@
 # Neural Ordinary Differential Equations with SimpleChains
 
-[SimpleChains](https://github.com/PumasAI/SimpleChains.jl) has demonstrated performance boosts of ~5x and ~30x when compared to other mainstream deep learning frameworks like Pytorch for the training and evaluation in the specific case of small neural networks. For the nitty-gritty details ,as well as, some SciML related videos around the need and applications of such a library we can refer to this [blogpost](https://julialang.org/blog/2022/04/simple-chains/).As for doing Scientific Machine Learning, how do we even begin with training neural ODEs with any generic deep learning library?
+[SimpleChains](https://github.com/PumasAI/SimpleChains.jl) has demonstrated performance boosts of ~5x and ~30x when compared to other mainstream deep learning frameworks like Pytorch for the training and evaluation in the specific case of small neural networks. For the nitty-gritty details, as well as, some SciML related videos around the need and applications of such a library, we can refer to this [blogpost](https://julialang.org/blog/2022/04/simple-chains/). As for doing Scientific Machine Learning, how do we even begin with training neural ODEs with any generic deep learning library?
 
 ## Training Data
 
-Firstly we'll need data for training the NeuralODE, which can be obtained by solving the ODE `u' = f(u,p,t)` numerically using the SciML ecosystem in Julia.
+First, we'll need data for training the NeuralODE, which can be obtained by solving the ODE `u' = f(u,p,t)` numerically using the SciML ecosystem in Julia.
 
 ```@example sc_neuralode
 using SimpleChains, StaticArrays, OrdinaryDiffEq, SciMLSensitivity, Optimization, OptimizationFlux, Plots
@@ -25,7 +25,7 @@ data = Array(solve(prob, Tsit5(), saveat = tsteps))
 
 ## Neural Network
 
-Next we setup a small neural network. It will be trained to output the derivative of the solution at each time step given the value of the solution at the previous time step and the parameters of the network. Thus, we are treating the neural network as a function `f(u,p,t)`. The difference is that instead of relying on knowing the exact equation for the ODE, we get to solve it only with the data.
+Next, we set up a small neural network. It will be trained to output the derivative of the solution at each time step given the value of the solution at the previous time step, and the parameters of the network. Thus, we are treating the neural network as a function `f(u,p,t)`. The difference is that instead of relying on knowing the exact equation for the ODE, we get to solve it only with the data.
 
 ```@example sc_neuralode
 sc = SimpleChain(
@@ -42,7 +42,7 @@ f(u,p,t) = sc(u,p)
 
 ## NeuralODE, Prediction and Loss
 
-Now instead of the function `trueODE(u,p,t)` in the first code block, we pass the neural network to the ODE solver. This is our NeuralODE. Now in order to train it we obtain predictions from the model and calculate the L2 loss against the data generated numerically previously.
+Now instead of the function `trueODE(u,p,t)` in the first code block, we pass the neural network to the ODE solver. This is our NeuralODE. Now, in order to train it, we obtain predictions from the model and calculate the L2 loss against the data generated numerically previously.
 
 ```@example sc_neuralode
 prob_nn = ODEProblem(f, u0, tspan)
@@ -60,9 +60,9 @@ end
 
 ## Training
 
-The next step is to minimize the loss, so that the NeuralODE gets trained. But in order to be able to do that, we have to be able to backpropagate through the NeuralODE model. Here the backpropagation through the neural network is the easy part and we get that out of the box with any deep learning package(although not as fast as SimpleChains for the small nn case here). But we have to find a way to first propagate the sensitivities of the loss back, first through the ODE solver and then to the neural network.
+The next step is to minimize the loss, so that the NeuralODE gets trained. But in order to be able to do that, we have to be able to backpropagate through the NeuralODE model. Here the backpropagation through the neural network is the easy part, and we get that out of the box with any deep learning package(although not as fast as SimpleChains for the small nn case here). But we have to find a way to first propagate the sensitivities of the loss back, first through the ODE solver and then to the neural network.
 
-The adjoint of a neural ODE can be calculated through the various AD algorithms available in SciMLSensitivity.jl. But for working with [StaticArrays](https://docs.sciml.ai/StaticArrays/stable/) in SimpleChains.jl we require a special adjoint method as StaticArrays do not allow any mutation. All the adjoint methods make heavy use of in-place mutation to be performant with the heap allocated normal arrays. For our statically sized, stack allocated StaticArrays, in order to be able to compute the ODE adjoint we need to do everything out of place. Hence we have specifically used `QuadratureAdjoint(autojacvec=ZygoteVJP())` adjoint algorithm in the solve call inside `predict_neuralode(p)` which computes everything out-of-place when u0 is a StaticArray. Hence we can move forward with the training of the NeuralODE
+The adjoint of a neural ODE can be calculated through the various AD algorithms available in SciMLSensitivity.jl. But working with [StaticArrays](https://docs.sciml.ai/StaticArrays/stable/) in SimpleChains.jl requires a special adjoint method as StaticArrays do not allow any mutation. All the adjoint methods make heavy use of in-place mutation to be performant with the heap allocated normal arrays. For our statically sized, stack allocated StaticArrays, in order to be able to compute the ODE adjoint we need to do everything out of place. Hence, we have specifically used `QuadratureAdjoint(autojacvec=ZygoteVJP())` adjoint algorithm in the solve call inside `predict_neuralode(p)` which computes everything out-of-place when u0 is a StaticArray. Hence, we can move forward with the training of the NeuralODE
 
 ```@example sc_neuralode
 callback = function (p, l, pred; doplot = true)
diff --git a/docs/src/examples/ode/exogenous_input.md b/docs/src/examples/ode/exogenous_input.md
index 5f2401aa1..2bbdc4805 100644
--- a/docs/src/examples/ode/exogenous_input.md
+++ b/docs/src/examples/ode/exogenous_input.md
@@ -1,6 +1,6 @@
 # Handling Exogenous Input Signals
 
-The key to using exogeneous input signals is the same as in the rest of the
+The key to using exogenous input signals is the same as in the rest of the
 SciML universe: just use the function in the definition of the differential
 equation. For example, if it's a standard differential equation, you can
 use the form
@@ -30,7 +30,7 @@ which encloses an extra argument into `f` so that `_f` is now the interface-comp
 differential equation definition.
 
 Note that you can also learn what the exogenous equation is from data. For an
-example on how to do this, you can use the [Optimal Control Example](@ref optcontrol)
+example on how to do this, you can use the [Optimal Control Example](@ref optcontrol),
 which shows how to parameterize a `u(t)` by a universal function and learn that
 from data.
 
diff --git a/docs/src/examples/ode/prediction_error_method.md b/docs/src/examples/ode/prediction_error_method.md
index bfb402938..4316ab9c1 100644
--- a/docs/src/examples/ode/prediction_error_method.md
+++ b/docs/src/examples/ode/prediction_error_method.md
@@ -1,6 +1,6 @@
 # [Prediction error method (PEM)](@id pemethod)
 
-When identifying linear systems from noisy data, the prediction-error method [^Ljung] is close to a gold standard when it comes to the quality of the models it produces, but is also one of the computationally more expensive methods due to its reliance on iterative, gradient-based estimation. When we are identifying nonlinear models, we typically do not have the luxury of closed-form, non-iterative solutions, while PEM is easier to adopt to the nonlinear setting.[^Larsson]
+When identifying linear systems from noisy data, the prediction-error method [^Ljung] is close to a gold standard when it comes to the quality of the models it produces, but is also one of the computationally more expensive methods due to its reliance on iterative, gradient-based estimation. When we are identifying nonlinear models, we typically do not have the luxury of closed-form, non-iterative solutions, while PEM is easier to adapt to the nonlinear setting.[^Larsson]
 
 Fundamentally, PEM changes the problem from minimizing a loss based on the simulation performance, to minimizing a loss based on shorter-term predictions. There are several benefits of doing so, and this example will highlight two:
 - The loss is often easier to optimize.
@@ -13,7 +13,7 @@ We will start by illustrating a common problem with simulation-error minimizatio
 
 Another case that poses a problem for simulation-error estimation is when the system is unstable or chaotic. A small error in either the initial condition or the parameters may cause the simulation error to diverge and its gradient to become meaningless.
 
-In both of these examples, we may make use of measurements we have of the evolution of the system to prevent the simulation error from diverging. For instance, if we have measured the angle of the pendulum, we can make use of this measurement to adjust the angle during the simulation to make sure it stays close to the measured angle. Instead of performing a pure simulation, we instead say that we *predict* the state a while forward in time, given all the measurements up until the current time point. By minimizing this prediction rather than the pure simulation, we can often prevent the model error from diverging even though we have a poor initial guess. 
+In both of these examples, we may make use of measurements we have of the evolution of the system to prevent the simulation error from diverging. For instance, if we have measured the angle of the pendulum, we can make use of this measurement to adjust the angle during the simulation to make sure it stays close to the measured angle. Instead of performing a pure simulation, we instead say that we *predict* the state a while forward in time, given all the measurements until the current time point. By minimizing this prediction rather than the pure simulation, we can often prevent the model error from diverging even though we have a poor initial guess. 
 
 We start by defining a model of the pendulum. The model takes a parameter $L$ corresponding to the length of the pendulum. 
 
@@ -120,7 +120,7 @@ plot!(Ls, predlosses, lab="Prediction loss")
 ```
 
 
-Once gain we look at the loss as a function of the parameter, and this time it looks a lot better. The loss is not convex, but the gradient points in the right direction over a much larger interval. Here, we arbitrarily set the observer gain to $K=1$, we will later let the optimizer learn this parameter.
+Once gain, we look at the loss as a function of the parameter, and this time it looks a lot better. The loss is not convex, but the gradient points in the right direction over a much larger interval. Here, we arbitrarily set the observer gain to $K=1$, we will later let the optimizer learn this parameter.
 
 For completeness, we also perform estimation using both losses. We choose an initial guess we know will be hard for the simulation-error minimization just to drive home the point:
 
@@ -155,7 +155,7 @@ respred.u
 ```
 
 Now, we might ask ourselves why we used a correct on the form $Ke$ and didn't instead set the angle in the simulation *equal* to the measurement. The reason is twofold
-1. If our prediction of the angle is 100% based on the measurements, the model parameters do not matter for the prediction and we can thus not hope to learn their values.
+1. If our prediction of the angle is 100% based on the measurements, the model parameters do not matter for the prediction, and we thus cannot hope to learn their values.
 2. The measurement is usually noisy, and we thus want to *fuse* the predictive power of the model with the information of the measurements. The Kalman filter is an optimal approach to this information fusion under special circumstances (linear model, Gaussian noise).
 
 We thus let the optimization *learn* the best value of the observer gain in order to make the best predictions. 
diff --git a/docs/src/examples/ode/second_order_adjoints.md b/docs/src/examples/ode/second_order_adjoints.md
index 3c6378841..5f59afb4e 100644
--- a/docs/src/examples/ode/second_order_adjoints.md
+++ b/docs/src/examples/ode/second_order_adjoints.md
@@ -6,7 +6,7 @@ second order sensitivity analysis for fast Hessians and Hessian-vector
 products (via forward-over-reverse), we can utilize these in our neural/universal
 differential equation training processes.
 
-`sciml_train` is setup to automatically use second order sensitivity analysis
+`sciml_train` is set up to automatically use second order sensitivity analysis
 methods if a second order optimizer is requested via Optim.jl. Thus `Newton`
 and `NewtonTrustRegion` optimizers will use a second order Hessian-based
 optimization, while `KrylovTrustRegion` will utilize a Krylov-based method
diff --git a/docs/src/examples/ode/second_order_neural.md b/docs/src/examples/ode/second_order_neural.md
index 984c8a187..330d70a42 100644
--- a/docs/src/examples/ode/second_order_neural.md
+++ b/docs/src/examples/ode/second_order_neural.md
@@ -6,7 +6,7 @@ The neural ODE focuses and finding a neural network such that:
 u^\prime = NN(u)
 ```
 
-However, in many cases in physics-based modeling, the key object is not the
+However, often in physics-based modeling, the key object is not the
 velocity but the acceleration: knowing the acceleration tells you the force
 field and thus the generating process for the dynamical system. Thus what we want
 to do is find the force, i.e.:
diff --git a/docs/src/examples/pde/pde_constrained.md b/docs/src/examples/pde/pde_constrained.md
index b9578a1a3..bad9dd86c 100644
--- a/docs/src/examples/pde/pde_constrained.md
+++ b/docs/src/examples/pde/pde_constrained.md
@@ -107,7 +107,7 @@ using DifferentialEquations, Optimization, OptimizationPolyalgorithms,
 
 ### Parameters
 
-First, we setup the 1-dimensional space over which our equations will be evaluated.
+First, we set up the 1-dimensional space over which our equations will be evaluated.
 `x` spans **from 0.0 to 10.0** in steps of **0.01**; `t` spans **from 0.00 to 0.04** in
 steps of **4.0e-5**.
 
@@ -166,7 +166,7 @@ end
 
 ### Heat Differential Equation
 
-Next, we setup our desired set of equations in order to define our problem.
+Next, we set up our desired set of equations in order to define our problem.
 
 ```@example pde2
 ## ODE description of the Physics:
@@ -180,7 +180,7 @@ end
 
 ### Solve and Plot Ground Truth
 
-We then solve and plot our partial differential equation. This is the true solution which we
+We then solve and plot our partial differential equation. This is the true solution, which we
 will compare to further on.
 
 ```@example pde2
@@ -208,14 +208,14 @@ end
 
 ### Train Parameters
 
-Training our model requires a **loss function**, an **optimizer** and a **callback
+Training our model requires a **loss function**, an **optimizer**, and a **callback
 function** to display the progress.
 
 #### Loss
 
 We first make our predictions based on the current values of our parameters `ps`, then
 take the difference between the predicted solution and the truth above. For the loss, we
-use the **Mean squared error**.
+use the **mean squared error**.
 
 ```@example pde2
 ## Defining Loss function
@@ -272,7 +272,7 @@ plot!(PRED[end][:,end], lw=2, label="Prediction")
 
 The parameters are trained using `Optimization.solve` and adjoint sensitivities.
 The resulting best parameters are stored in `res` and `res.u` returns the
-parameters that minimizes the cost function.
+parameters that minimize the cost function.
 
 ```@example pde2
 adtype = Optimization.AutoZygote()
diff --git a/docs/src/examples/sde/SDE_control.md b/docs/src/examples/sde/SDE_control.md
index 3423318ce..20f845f27 100644
--- a/docs/src/examples/sde/SDE_control.md
+++ b/docs/src/examples/sde/SDE_control.md
@@ -1,7 +1,7 @@
 # Controlling Stochastic Differential Equations
 
 In this tutorial, we show how to use DiffEqFlux to control the time evolution of a system
-described by a stochastic differential equations (SDE). Specifically, we consider a
+described by a stochastic differential equation (SDE). Specifically, we consider a
 continuously monitored qubit described by an SDE in the Ito sense with multiplicative
 scalar noise (see [1] for a reference):
 
@@ -312,7 +312,7 @@ using Plots
 ```
 
 ### Parameters
-We define the parameters of the qubit and hyper-parameters of the training process.
+We define the parameters of the qubit and hyperparameters of the training process.
 ```@example sdecontrol
 lr = 0.01f0
 epochs = 100
@@ -371,7 +371,7 @@ In plain terms, the quantities that were defined are:
 - `Δ` = detuning between the qubit and the laser
 - `Ωmax` = maximum frequency of the control laser
 - `κ` = decay rate
-- `C1` = loss function hyper-parameter
+- `C1` = loss function hyperparameter
 
 ### Controller
 We use a neural network to control the parameter Ω(t). Alternatively, one could
diff --git a/docs/src/examples/sde/optimization_sde.md b/docs/src/examples/sde/optimization_sde.md
index 086bb6923..f6f306eed 100644
--- a/docs/src/examples/sde/optimization_sde.md
+++ b/docs/src/examples/sde/optimization_sde.md
@@ -5,7 +5,7 @@ SciMLSensitivity.jl) for forward-mode automatic differentiation of a small
 stochastic differential equation. For large parameter equations, like neural
 stochastic differential equations, you should use reverse-mode automatic
 differentiation. However, forward-mode can be more efficient for low numbers
-of parameters (<100). (Note: the default is reverse-mode AD which is more suitable
+of parameters (<100). (Note: the default is reverse-mode AD, which is more suitable
 for things like neural SDEs!)
 
 ## Example 1: Fitting Data with SDEs via Method of Moments and Parallelism
@@ -108,10 +108,10 @@ approximately 4 minutes.
 
 ## Example 2: Fitting SDEs via Bayesian Quasi-Likelihood Approaches
 
-An inference method which can be much more efficient in many cases is the quasi-likelihood approach.
+An inference method which can often be much more efficient is the quasi-likelihood approach.
 This approach matches the random likelihood of the SDE output with the random sampling of a Bayesian
 inference problem to more efficiently directly estimate the posterior distribution. For more information,
-please see [the Turing.jl Bayesian Differential Equations tutorial](https://github.com/TuringLang/TuringTutorials/blob/master/10_diffeq.ipynb)
+please see [the Turing.jl Bayesian Differential Equations tutorial](https://github.com/TuringLang/TuringTutorials/blob/master/10_diffeq.ipynb).
 
 ## Example 3: Controlling SDEs to an objective
 
diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
index 8fec8174d..a08555a77 100644
--- a/docs/src/getting_started.md
+++ b/docs/src/getting_started.md
@@ -2,24 +2,25 @@
 
 !!! warn
 
-      This tutorial assumes familiarity with DifferentialEquations.jl
+      This tutorial assumes familiarity with DifferentialEquations.jl.
       If you are not familiar with DifferentialEquations.jl, please consult
-      [the DifferentialEquations.jl documentation](https://docs.sciml.ai/DiffEqDocs/stable/)
+      [the DifferentialEquations.jl documentation](https://docs.sciml.ai/DiffEqDocs/stable/).
 
 SciMLSensitivity.jl is a tool for obtaining derivatives of equation solvers,
 such as differential equation solvers. These can be used in many ways, such as
 for analyzing the local sensitivities of a system or to compute the gradients
-of cost functions for model calibratrion and parameter estimation. In this
-tutorial we will show how to make use of the tooling in SciMLSensitivity.jl
+of cost functions for model calibration and parameter estimation. In this
+tutorial, we will show how to make use of the tooling in SciMLSensitivity.jl
 to differentiate the ODE solvers.
 
 !!! note
-  SciMLSensitivity.jl applies to all equation solvers of the SciML ecosystem,
-  such as for linear solvers, nonlinear solvers, nonlinear optimization,
-  and more. This tutorial focuses on differential equations, so please see
-  the other tutorials focused on these other SciMLProblem types as necessary.
-  While the interface works similarly for all problem types, these tutorials
-  will showcase the aspects that are special to a given problem.
+
+    SciMLSensitivity.jl applies to all equation solvers of the SciML ecosystem,
+    such as linear solvers, nonlinear solvers, nonlinear optimization,
+    and more. This tutorial focuses on differential equations, so please see
+    the other tutorials focused on these other SciMLProblem types as necessary.
+    While the interface works similarly for all problem types, these tutorials
+    will showcase the aspects that are special to a given problem.
 
 ## Setup
 
@@ -45,10 +46,10 @@ differentiation methods.
 
 Let's say we need the derivative of the solution with respect to the initial condition
 `u0` and its parameters `p`. One of the simplest ways to do this is via ForwardDiff.jl.
-To do this, all that one needs to do is use 
+All one needs to do is to use 
 [the ForwardDiff.jl library](https://juliadiff.org/ForwardDiff.jl/stable/) to differentiate
 some function `f` which uses a differential equation `solve` inside of it. For example,
-let's say we want the derivative of the first component of ODE solution with respect to 
+let's say we want the derivative of the first component of the ODE solution with respect to 
 these quantities at evenly spaced time points of `dt = 1`. We can compute this via:
 
 ```@example diffode
@@ -80,7 +81,7 @@ solution at time `t=1` with respect to `p[1]`.
 ## Reverse-Mode Automatic Differentiation
 
 [The `solve` function is automatically compatible with AD systems like Zygote.jl](https://docs.sciml.ai/SciMLSensitivity/stable/)
-and thus there is no machinery that is necessary to use other than to put `solve` inside of
+and thus there is no machinery that is necessary to use other than to put `solve` inside
 a function that is differentiated by Zygote. For example, the following computes the solution 
 to an ODE and computes the gradient of a loss function (the sum of the ODE's output at each 
 timepoint with dt=0.1) via the adjoint method:
@@ -103,7 +104,7 @@ chosen.
 ### Choosing Sensitivity Algorithms
 
 The algorithms for differentiation calculation are called `AbstractSensitivityAlgorithms`,
-or `sensealg`s for short. These are choosen by passing the `sensealg` keyword argument into solve.
+or `sensealg`s for short. These are chosen by passing the `sensealg` keyword argument into solve.
 Let's demonstrate this by choosing the `QuadratureAdjoint` `sensealg` for the differentiation of
 this system:
 
@@ -116,9 +117,9 @@ du01,dp1 = Zygote.gradient(sum_of_solution,u0,p)
 ```
 
 Here this computes the derivative of the output with respect to the initial
-condition and the the derivative with respect to the parameters respectively
+condition and the derivative with respect to the parameters respectively
 using the `QuadratureAdjoint()`. For more information on the choices of sensitivity
-algorithms, see the [reference documentation in choosing sensitivity algorithms](@ref sensitivity_diffeq)
+algorithms, see the [reference documentation in choosing sensitivity algorithms](@ref sensitivity_diffeq).
 
 !!! note
     ForwardDiff.jl's automatic differentiation system ignores the sensitivity algorithms.
@@ -126,14 +127,14 @@ algorithms, see the [reference documentation in choosing sensitivity algorithms]
 ## When Should You Use Forward or Reverse Mode?
 
 Good question! The simple answer is, if you are differentiating a system of
-100 equations or less, use forward-mode, otherwise reverse-mode. But it can
+fewer than 100 equations, use forward-mode, otherwise reverse-mode. But it can
 be a lot more complicated than that! For more information, see the 
-[reference documentation in choosing sensitivity algorithms](@ref sensitivity_diffeq)
+[reference documentation in choosing sensitivity algorithms](@ref sensitivity_diffeq).
 
 ## And that is it! Where should you go from here?
 
 That's all there is to the basics of differentiating the ODE solvers with SciMLSensitivity.jl.
-That said, check out the follwing tutorials to dig into more detail:
+That said, check out the following tutorials to dig into more detail:
 
 * See the [ODE parameter estimation tutorial](@ref odeparamestim) to learn how to fit the parameters of ODE systems
 * See the [direct sensitivity tutorial](@ref direct_sensitivity) to dig into the lower level API for more performance
\ No newline at end of file
diff --git a/docs/src/index.md b/docs/src/index.md
index fd7f766c1..49babf6c5 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -3,9 +3,9 @@
 SciMLSensitivity.jl is the automatic differentiation and adjoints system for the SciML
 ecosystem. Also known as local sensitivity analysis, these methods allow for calculation
 of fast derivatives of SciML problem types which are commonly used to analyze model
-sensitivities, callibrate models to data, train neural ODEs, perform automated model
+sensitivities, calibrate models to data, train neural ODEs, perform automated model
 discovery via universal differential equations, and more. SciMLSensitivity.jl is
-a high level interface that pulls together all of the tools with heuristics
+a high-level interface that pulls together all the tools with heuristics
 and helper functions to make solving inverse problems and inferring models
 as easy as possible without losing efficiency.
 
@@ -36,7 +36,7 @@ using Pkg
 Pkg.add("SciMLSensitivity")
 ```
 
-## High Level Interface: `sensealg`
+## High-Level Interface: `sensealg`
 
 The highest level interface is provided by the function `solve`:
 
@@ -69,7 +69,7 @@ is used, i.e. going back to the AD mechanism.
     
 ## Equation Scope
 
-SciMLSensitivity.jl supports all of the equation types of the 
+SciMLSensitivity.jl supports all the equation types of the 
 [SciML Common Interface](https://docs.sciml.ai/SciMLBase/stable/), extending the problem
 types by adding overloads for automatic differentiation to improve the performance
 and flexibility of the differentiation system. This includes:
@@ -110,7 +110,7 @@ SciMLSensitivity is for universal differential equations, where these can includ
 delays, physical constraints, stochasticity, events, and all other kinds of
 interesting behavior that shows up in scientific simulations. Neural networks can
 be all or part of the model. They can be around the differential equation,
-in the cost function, or inside of the differential equation. Neural networks
+in the cost function, or inside the differential equation. Neural networks
 representing unknown portions of the model or functions can go anywhere you
 have uncertainty in the form of the scientific simulator. Forward sensitivity
 and adjoint equations are automatically generated with checkpointing and
@@ -138,13 +138,13 @@ post](https://julialang.org/blog/2019/01/fluxdiffeq) (which we try to keep
 updated for changes to the libraries). Additional demonstrations, like neural
 PDEs and neural jump SDEs, can be found [at this blog
 post](http://www.stochasticlifestyle.com/neural-jump-sdes-jump-diffusions-and-neural-pdes/)
-(among many others!). All of these features are only part of the advantage, as this library
+(among many others!). All these features are only part of the advantage, as this library
 [routinely benchmarks orders of magnitude faster than competing libraries like torchdiffeq](@ref Benchmarks).
 Use with GPUs is highly optimized by
 [recompiling the solvers to GPUs to remove all CPU-GPU data transfers](https://www.stochasticlifestyle.com/solving-systems-stochastic-pdes-using-gpus-julia/),
 while use with CPUs uses specialized kernels for accelerating differential equation solves.
 
-Many different training techniques are supported by this package, including:
+Many training techniques are supported by this package, including:
 
 - Optimize-then-discretize (backsolve adjoints, checkpointed adjoints, quadrature adjoints)
 - Discretize-then-optimize (forward and reverse mode discrete sensitivity analysis)
@@ -158,7 +158,7 @@ Many different training techniques are supported by this package, including:
   equations etc. is provided by integration with [Turing.jl](https://turing.ml/stable/docs/using-turing/)
   and [Gen.jl](https://github.com/probcomp/Gen.jl). Reproduce
   [variational loss functions](https://arxiv.org/abs/2001.01328) by plugging
-  [composible libraries together](https://turing.ml/stable/tutorials/09-variational-inference/).
+  [composable libraries together](https://turing.ml/stable/tutorials/09-variational-inference/).
 
 all while mixing forward mode and reverse mode approaches as appropriate for the
 most speed. For more details on the adjoint sensitivity analysis methods for
@@ -175,8 +175,8 @@ With this package, you can explore various ways to integrate the two methodologi
 ## Note on Modularity and Composability with Solvers
 
 Note that SciMLSensitivity.jl purely built on composable and modular infrastructure. 
-SciMLSensitivity provides high level helper functions and documentation for the user, but the
-code generation stack is modular and composes in many different ways. For example, one can
+SciMLSensitivity provides high-level helper functions and documentation for the user, but the
+code generation stack is modular and composes in many ways. For example, one can
 use and swap out the ODE solver between any common interface compatible library, like:
 
 - Sundials.jl
@@ -184,7 +184,7 @@ use and swap out the ODE solver between any common interface compatible library,
 - LSODA.jl
 - [IRKGaussLegendre.jl](https://github.com/mikelehu/IRKGaussLegendre.jl)
 - [SciPyDiffEq.jl](https://github.com/SciML/SciPyDiffEq.jl)
-- [... etc. many other choices!](https://docs.sciml.ai/DiffEqDocs/stable/solvers/ode_solve/)
+- [… etc. many other choices!](https://docs.sciml.ai/DiffEqDocs/stable/solvers/ode_solve/)
 
 In addition, due to the composability of the system, none of the components are directly
 tied to the Flux.jl machine learning framework. For example, you can [use SciMLSensitivity.jl
diff --git a/docs/src/sensitivity_math.md b/docs/src/sensitivity_math.md
index cd0db855b..a1cdd0f43 100644
--- a/docs/src/sensitivity_math.md
+++ b/docs/src/sensitivity_math.md
@@ -65,7 +65,7 @@ by using a dual number with a single partial dimension, ``d = x + v \epsilon`` w
 f(d) = f(x) + Jv \epsilon
 ```
 
-as a fast way to calcuate ``Jv``. Thus, except when a sufficiently good function for `J` is given
+as a fast way to calculate ``Jv``. Thus, except when a sufficiently good function for `J` is given
 by the user, the Jacobian is never formed. For more details, consult the
 [MIT 18.337 lecture notes on forward mode AD](https://mitmath.github.io/18337/lecture8/automatic_differentiation.html).
 
@@ -100,7 +100,7 @@ requires the continuous forward solution in order to solve the adjoint solution,
 and the adjoint solution is required to be continuous in order to calculate the
 resulting integral.
 
-There is one extra detail to consider. In many cases we would like to calculate
+There is one extra detail to consider. In many cases, we would like to calculate
 the adjoint sensitivity of some discontinuous functional of the solution. One
 canonical function is the L2 loss against some data points, that is:
 
@@ -132,6 +132,6 @@ We note that
 ```
 
 is a vector-transpose Jacobian product, also known as a `vjp`, which can be efficiently computed
-using the pullback of backpropogation on the user function `f` with a forward pass at `u` with a
+using the pullback of backpropagation on the user function `f` with a forward pass at `u` with a
 pullback vector ``\lambda^{\star}``. For more information, consult the
-[MIT 18.337 lecture notes on reverse mode AD](https://mitmath.github.io/18337/lecture10/estimation_identification)
+[MIT 18.337 lecture notes on reverse mode AD](https://mitmath.github.io/18337/lecture10/estimation_identification).
diff --git a/docs/src/tutorials/adjoint_continuous_functional.md b/docs/src/tutorials/adjoint_continuous_functional.md
index f4ff224d9..4c8808199 100644
--- a/docs/src/tutorials/adjoint_continuous_functional.md
+++ b/docs/src/tutorials/adjoint_continuous_functional.md
@@ -19,7 +19,7 @@ vector.
 However, there is an expanded set of cost functionals supported by SciMLSensitivity.jl,
 continuous cost functionals, which are not possible through automatic differentiation
 interfaces. In an abstract sense, a continuous cost functional is a total cost ``G``
-defined as the integral of the instantanious cost ``g`` at all time points. In other words,
+defined as the integral of the instantaneous cost ``g`` at all time points. In other words,
 the total cost is defined as:
 
 ```math
@@ -33,7 +33,7 @@ functionals can be easily evaluated using the direct sensitivity analysis interf
 ## Example: Continuous Functionals with Forward Sensitivity Analysis via Interpolation
 
 Evaluating continuous cost functionals with forward sensitivity analysis is rather
-straightforward since one can simply use the fact that the solution from
+straightforward, since one can simply use the fact that the solution from
 `ODEForwardSensitivityProblem` is continuous when `dense=true`. For example,
 
 ```@example continuousadjoint
@@ -52,12 +52,12 @@ sol = solve(prob,DP8())
 gives a continuous solution `sol(t)` with the derivative at each time point. This
 can then be used to define a continuous cost function via
 [Integrals.jl](https://docs.sciml.ai/Integrals/stable/), though the derivative would
-need to be defined by hand using the extra sensitivity terms.
+need to be manually defined using the extra sensitivity terms.
 
 ## Example: Continuous Adjoints on an Energy Functional
 
 Continuous adjoints on a continuous functional are more automatic than forward mode.
-In this case we'd like to calculate the adjoint sensitivity of the scalar energy
+In this case, we'd like to calculate the adjoint sensitivity of the scalar energy
 functional:
 
 ```math
diff --git a/docs/src/tutorials/chaotic_ode.md b/docs/src/tutorials/chaotic_ode.md
index d2e05c064..38cacb42d 100644
--- a/docs/src/tutorials/chaotic_ode.md
+++ b/docs/src/tutorials/chaotic_ode.md
@@ -15,7 +15,7 @@ where
 ```
 under the assumption of ergodicity, ``\langle g \rangle_∞`` only depends on `p`.
 
-In the case of chaotic systems, the trajectories diverge with ``O(1)`` error]. This
+In the case of chaotic systems, the trajectories diverge with ``O(1)`` error. This
 can be seen, for instance, when solving the [Lorenz system](https://en.wikipedia.org/wiki/Lorenz_system) at
 `1e-14` tolerances with 9th order integrators and a small machine-epsilon perturbation:
 
@@ -39,7 +39,7 @@ sol2 = solve(prob, Vern9(), abstol = 1e-14 + eps(Float64), reltol = 1e-14)
 
 More formally, such chaotic behavior can be analyzed using tools from
 [uncertainty quantification](@ref uncertainty_quantification).
-This effect of diverging trajectories is known as the butterfly effect and can be
+This effect of diverging trajectories is known as the butterfly effect, and can be
 formulated as "most (small) perturbations on initial conditions or parameters lead
 to new trajectories diverging exponentially fast from the original trajectory".
 
@@ -48,7 +48,7 @@ as follows: "For most initial conditions, the (homogeneous) tangent solutions gr
 exponentially fast."
 
 To compute derivatives of an objective ``\langle g \rangle_∞`` with respect to the
-parameters `p` of a chaotic systems, one thus encounters that "traditional" forward
+parameters `p` of a chaotic system, one thus encounters that “traditional” forward
 and adjoint sensitivity methods diverge because the tangent space diverges with a
 rate given by the Lyapunov exponent. Taking the average of these derivative can then
 also fail, i.e., one finds that the average derivative is not the derivative of
@@ -56,7 +56,7 @@ the average.
 
 Although numerically computed chaotic trajectories diverge from the true/original
 trajectory, the [shadowing theorem](http://mathworld.wolfram.com/ShadowingTheorem.html) guarantees that there exists an errorless trajectory
-with a slightly different initial condition that stays near ("shadows") the numerically
+with a slightly different initial condition that stays near (“shadows”) the numerically
 computed one, see, e.g, the [blog post](https://frankschae.github.io/post/shadowing/) or the [non-intrusive least squares shadowing paper](https://arxiv.org/abs/1611.00880) for more details.
 Essentially, the idea is to replace the ill-conditioned ODE by a well-conditioned
 optimization problem. Shadowing methods use the shadowing theorem within a renormalization
diff --git a/docs/src/tutorials/data_parallel.md b/docs/src/tutorials/data_parallel.md
index 7f5b5d7be..c442680e4 100644
--- a/docs/src/tutorials/data_parallel.md
+++ b/docs/src/tutorials/data_parallel.md
@@ -11,7 +11,7 @@ different modes of parallelism. These examples are not exhaustive.
 ## Within-ODE Multithreaded and GPU Batching
 
 We end by noting that there is an alternative way of batching which
-can be more efficient in some cases like neural ODEs. With a neural
+can be more efficient in some cases, like neural ODEs. With neural
 networks, columns are treated independently (by the properties of
 matrix multiplication). Thus for example, with `Chain` we can
 define an ODE:
@@ -55,8 +55,8 @@ prob = ODEProblem(f,Lux.gpu(u0),(0f0,1f0),Lux.gpu(p))
 solve(prob,Tsit5())
 ```
 
-This method of parallelism is optimal if all of the operations are
-linear algebra operations such as a neural ODE. Thus this method of
+This method of parallelism is optimal if all the operations are
+linear algebra operations, such as a neural ODE. Thus this method of
 parallelism is demonstrated in the [MNIST tutorial](@ref mnist).
 
 However, this method of parallelism has many limitations. First of all,
@@ -130,7 +130,7 @@ In order to make use of the ensemble interface, we need to build an
 the different `DEProblem`s to solve. This is the place where we can
 randomly sample initial conditions or pull initial conditions from
 an array of batches in order to perform our study. To do this, we
-first define a prototype `DEProblem`. Here we use the following
+first define a prototype `DEProblem`. Here, we use the following
 `ODEProblem` as our base:
 
 ```@example dataparallel
@@ -154,7 +154,7 @@ We now build the `EnsembleProblem` with this basis:
 ensemble_prob = EnsembleProblem(prob, prob_func = prob_func)
 ```
 
-Now to solve an ensemble problem, we need to choose an ensembling
+Now, to solve an ensemble problem, we need to choose an ensembling
 algorithm and choose the number of trajectories to solve. Here let's
 solve this in serial with 100 trajectories. Note that `i` will thus run
 from `1:100`.
@@ -186,11 +186,11 @@ all the same, except you utilize `EnsembleDistributed` as the ensembler:
 sim = solve(ensemble_prob, Tsit5(), EnsembleDistributed(), saveat = 0.1, trajectories = 100)
 ```
 
-Note that for this to work you need to ensure that your processes are
+Note that for this to work, you need to ensure that your processes are
 already started. For more information on setting up processes and utilizing
 a compute cluster, see [the official distributed documentation](https://docs.julialang.org/en/v1/manual/distributed-computing/). The key feature to recognize is that, due to
-the message passing required for cluster compute, one needs to ensure
-that all of the required functions are defined on the worker processes.
+the message passing required for cluster compute, one must ensure
+that all the required functions are defined on the worker processes.
 The following is a full example of a distributed batching setup:
 
 ```julia
@@ -243,7 +243,7 @@ to a cluster, check out [ClusterManagers.jl](https://github.com/JuliaParallel/Cl
 
 DiffEqGPU.jl allows for generating code parallelizes an ensemble on
 generated CUDA kernels. This method is efficient for sufficiently
-small (<100 ODE) problems where the significant computational cost
+small (<100 ODE) problems, where the significant computational cost
 is due to the large number of batch trajectories that need to be
 solved. This kernel-building process adds a few restrictions to the
 function, such as requiring it has no boundschecking or allocations.
diff --git a/docs/src/tutorials/direct_sensitivity.md b/docs/src/tutorials/direct_sensitivity.md
index a1a7ea18b..96f99aa54 100644
--- a/docs/src/tutorials/direct_sensitivity.md
+++ b/docs/src/tutorials/direct_sensitivity.md
@@ -10,7 +10,7 @@ demonstrates some of those functions.
 Forward sensitivity analysis is performed by defining and solving an augmented
 ODE. To define this augmented ODE, use the `ODEForwardSensitivityProblem` type
 instead of an ODE type. For example, we generate an ODE with the sensitivity
-equations attached for the Lotka-Volterra equations by:
+equations attached to the Lotka-Volterra equations by:
 
 ```@example directsense
 using OrdinaryDiffEq, SciMLSensitivity
@@ -67,8 +67,8 @@ solution, see the [direct forward sensitivity analysis manual page](@ref forward
 
 ## Example using `adjoint_sensitivities` for discrete adjoints
 
-In this example we will show solving for the adjoint sensitivities of a discrete
-cost functional. First let's solve the ODE and get a high quality continuous
+In this example, we will show solving for the adjoint sensitivities of a discrete
+cost functional. First, let's solve the ODE and get a high quality continuous
 solution:
 
 ```@example directsense
diff --git a/docs/src/tutorials/parameter_estimation_ode.md b/docs/src/tutorials/parameter_estimation_ode.md
index 7e36444fa..d20d7dfd4 100644
--- a/docs/src/tutorials/parameter_estimation_ode.md
+++ b/docs/src/tutorials/parameter_estimation_ode.md
@@ -60,7 +60,7 @@ result_ode = Optimization.solve(optprob, PolyOpt(),
 
 ## Explanation
 
-First let's create a Lotka-Volterra ODE using DifferentialEquations.jl. For
+First, let's create a Lotka-Volterra ODE using DifferentialEquations.jl. For
 more details, [see the DifferentialEquations.jl documentation](https://docs.sciml.ai/DiffEqDocs/stable/). The Lotka-Volterra equations have the form:
 
 ```math
@@ -104,10 +104,10 @@ savefig("LV_ode.png")
 ![LV Solution Plot](https://user-images.githubusercontent.com/1814174/51388169-9a07f300-1af6-11e9-8c6c-83c41e81d11c.png)
 
 For this first example, we do not yet include a neural network. We take
-[AD-compatible `solve`
-function](https://docs.sciml.ai/SciMLSensitivity/stable/manual/differential_equation_sensitivities/) function
+[an AD-compatible `solve`
+function](https://docs.sciml.ai/SciMLSensitivity/stable/manual/differential_equation_sensitivities/)
 that takes the parameters and an initial condition and returns the solution of
-the differential equation. Next we choose a loss function. Our goal will be to
+the differential equation. Next, we choose a loss function. Our goal will be to
 find parameters that make the Lotka-Volterra solution constant `x(t)=1`, so we
 define our loss as the squared distance from 1.
 
diff --git a/docs/src/tutorials/training_tips/divergence.md b/docs/src/tutorials/training_tips/divergence.md
index f57180ea1..953b89fa3 100644
--- a/docs/src/tutorials/training_tips/divergence.md
+++ b/docs/src/tutorials/training_tips/divergence.md
@@ -74,8 +74,8 @@ res = Optimization.solve(optprob,ADAM(), maxiters = 1000)
 # res = Optimization.solve(optprob,NLopt.LD_LBFGS(), maxiters = 1000) ### errors!
 ```
 
-You might notice that `AutoZygote` (default) fails for the above `Optimization.solve` call with Optim's optimizers which happens because
-of Zygote's behaviour for zero gradients in which case it returns `nothing`. To avoid such issue you can just use a different version of the same check which compares the size of the obtained 
+You might notice that `AutoZygote` (default) fails for the above `Optimization.solve` call with Optim's optimizers, which happens because
+of Zygote's behavior for zero gradients, in which case it returns `nothing`. To avoid such issues, you can just use a different version of the same check which compares the size of the obtained 
 solution and the data we have, shown below, which is easier to AD.
 
 ```julia
diff --git a/docs/src/tutorials/training_tips/local_minima.md b/docs/src/tutorials/training_tips/local_minima.md
index 80f2c21a8..52d2eb215 100644
--- a/docs/src/tutorials/training_tips/local_minima.md
+++ b/docs/src/tutorials/training_tips/local_minima.md
@@ -10,9 +10,9 @@ there are many strategies to avoid local minima:
 
 ## Iterative Growing Of Fits to Reduce Probability of Bad Local Minima
 
-In this example we will show how to use strategy (4) in order to increase the
+In this example, we will show how to use strategy (4) in order to increase the
 robustness of the fit. Let's start with the same neural ODE example we've used
-before except with one small twist: we wish to find the neural ODE that fits
+before, except with one small twist: we wish to find the neural ODE that fits
 on `(0,5.0)`. Naively, we use the same training strategy as before:
 
 ```@example iterativefit
@@ -85,12 +85,12 @@ plt = scatter(tsteps[1:size(pred,2)], ode_data[1,1:size(pred,2)], label = "data"
 scatter!(plt, tsteps[1:size(pred,2)], pred[1,:], label = "prediction")
 ```
 
-However, we've now fallen into a trap of a local minima. If the optimizer changes
-the parameters so it dips early, it will increase the loss because there will
+However, we've now fallen into a trap of a local minimum. If the optimizer changes
+the parameters, so it dips early, it will increase the loss because there will
 be more error in the later parts of the time series. Thus it tends to just stay
 flat and never fit perfectly. This thus suggests strategies (2) and (3): do not
 allow the later parts of the time series to influence the fit until the later
-stages. Strategy (3) seems to be more robust, so this is what will be demonstrated.
+stages. Strategy (3) seems more robust, so this is what will be demonstrated.
 
 Let's start by reducing the timespan to `(0,1.5)`:
 
@@ -134,7 +134,7 @@ plt = scatter(tsteps[1:size(pred,2)], ode_data[1,1:size(pred,2)], label = "data"
 scatter!(plt, tsteps[1:size(pred,2)], pred[1,:], label = "prediction")
 ```
 
-Once again a great fit. Now we utilize these parameters as the initial condition
+Once again, a great fit. Now we utilize these parameters as the initial condition
 to the full fit:
 
 ```@example iterativefit
@@ -156,7 +156,7 @@ scatter!(plt, tsteps[1:size(pred,2)], pred[1,:], label = "prediction")
 
 ## Training both the initial conditions and the parameters to start
 
-In this example we will show how to use strategy (4) in order to accomplish the
+In this example, we will show how to use strategy (4) in order to accomplish the
 same goal, except rather than growing the trajectory iteratively, we can train on
 the whole trajectory. We do this by allowing the neural ODE to learn both the
 initial conditions and parameters to start, and then reset the initial conditions
diff --git a/docs/src/tutorials/training_tips/multiple_nn.md b/docs/src/tutorials/training_tips/multiple_nn.md
index 17e72a938..96a72068f 100644
--- a/docs/src/tutorials/training_tips/multiple_nn.md
+++ b/docs/src/tutorials/training_tips/multiple_nn.md
@@ -86,7 +86,7 @@ res2_uode = Optimization.solve(optprob2, NLopt.LD_LBFGS(), maxiters = 10000, cal
 ```
 
 The key is that `Optimization.solve` acts on a single parameter vector `p`.
-Thus what we do here is concatenate all of the parameters into a single
+Thus what we do here is concatenate all the parameters into a single
 ComponentVector `p` and then train on this parameter
 vector. Whenever we need to evaluate the neural networks, we dereference the
 vector and grab the key that corresponds to the neural network.