diff --git a/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.pdf b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.pdf
new file mode 100644
index 00000000..d3dbcbeb
Binary files /dev/null and b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.pdf differ
diff --git a/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.png b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.png
index 14beec84..dbf03e5e 100644
Binary files a/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.png and b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.png differ
diff --git a/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.tex b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.tex
index 993b693b..18c999e5 100644
--- a/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.tex
+++ b/lectures/_static/lecture_specific/orth_proj/orth_proj_thm2.tex
@@ -19,12 +19,13 @@
     \draw[important line, thick]  (Z1) -- (O) node[right] {};
     \draw[important line, thick]  (Py) -- (Z2) node[right] {$S$};
     \draw[important line, blue,->]  (O) -- (y) node[right] {$y$};
+    \draw[important line, blue,->]  (O) -- (Py') node[anchor = north west, text width=5em] {$P y'$};
     \draw[dotted line] (0.54,0.27) -- (0.51,0.33);
     \draw[dotted line] (0.57,0.36) -- (0.51,0.33);
     \draw[dotted line] (-0.22,-0.11) -- (-0.25,-0.05);
     \draw[dotted line] (-0.31,-0.08) -- (-0.25,-0.05);
     \draw[dashed line, black] (y) -- (Py);
-    \draw[dashed line, black] (y') -- (Py') node[anchor = north west, text width=5em] {$P y'$};
+    \draw[dashed line, black] (y') -- (Py');
 \end{tikzpicture}
 
 \end{document}
\ No newline at end of file
diff --git a/lectures/orth_proj.md b/lectures/orth_proj.md
index 1f78640f..f9a88298 100644
--- a/lectures/orth_proj.md
+++ b/lectures/orth_proj.md
@@ -48,7 +48,7 @@ import numpy as np
 from scipy.linalg import qr
 ```
 
-### Further Reading
+### Further reading
 
 For background and foundational concepts, see our lecture [on linear algebra](https://python-intro.quantecon.org/linear_algebra.html).
 
@@ -58,9 +58,9 @@ For a complete set of proofs in a general setting, see, for example, {cite}`Roma
 
 For an advanced treatment of projection in the context of least squares prediction, see [this book chapter](http://www.tomsargent.com/books/TOMchpt.2.pdf).
 
-## Key Definitions
+## Key definitions
 
-Assume  $x, z \in \mathbb R^n$.
+Assume $x, z \in \mathbb R^n$.
 
 Define $\langle x,  z\rangle = \sum_i x_i z_i$.
 
@@ -86,7 +86,7 @@ The **orthogonal complement** of linear subspace $S \subset \mathbb R^n$ is the
 
 ```
 
-$S^\perp$ is  a linear subspace of $\mathbb R^n$
+$S^\perp$ is a linear subspace of $\mathbb R^n$
 
 * To see this, fix $x, y \in S^{\perp}$ and $\alpha, \beta \in \mathbb R$.
 * Observe that if $z \in S$, then
@@ -117,7 +117,7 @@ $$
  = \| x_1 \|^2 + \| x_2 \|^2
 $$
 
-### Linear Independence vs Orthogonality
+### Linear independence vs orthogonality
 
 If $X \subset \mathbb R^n$ is an orthogonal set and $0 \notin X$, then $X$ is linearly independent.
 
@@ -125,13 +125,16 @@ Proving this is a nice exercise.
 
 While the converse is not true, a kind of partial converse holds, as we'll {ref}`see below <gram_schmidt>`.
 
-## The Orthogonal Projection Theorem
+## The orthogonal projection theorem
 
 What vector within a linear subspace of $\mathbb R^n$  best approximates a given vector in $\mathbb R^n$?
 
 The next theorem answers this question.
 
-**Theorem** (OPT) Given $y \in \mathbb R^n$ and linear subspace $S \subset \mathbb R^n$,
+```{prf:theorem} Orthogonal Projection Theorem
+:label: opt
+
+Given $y \in \mathbb R^n$ and linear subspace $S \subset \mathbb R^n$,
 there exists a unique solution to the minimization problem
 
 $$
@@ -144,6 +147,7 @@ The minimizer $\hat y$ is the unique vector in $\mathbb R^n$ that satisfies
 * $y - \hat y \perp S$
 
 The vector $\hat y$ is called the **orthogonal projection** of $y$ onto $S$.
+```
 
 The next figure provides some intuition
 
@@ -151,7 +155,7 @@ The next figure provides some intuition
 
 ```
 
-### Proof of Sufficiency
+### Proof of sufficiency
 
 We'll omit the full proof.
 
@@ -171,7 +175,7 @@ $$
 
 Hence $\| y - z \| \geq \| y - \hat y \|$, which completes the proof.
 
-### Orthogonal Projection as a Mapping
+### Orthogonal projection as a mapping
 
 For a linear space $Y$ and a fixed linear subspace $S$, we have a functional relationship
 
@@ -179,7 +183,7 @@ $$
 y \in Y\; \mapsto \text{ its orthogonal projection } \hat y \in S
 $$
 
-By the OPT, this is a well-defined mapping  or *operator* from $\mathbb R^n$ to $\mathbb R^n$.
+By the {prf:ref}`opt`, this is a well-defined mapping  or *operator* from $\mathbb R^n$ to $\mathbb R^n$.
 
 In what follows we denote this operator by a matrix $P$
 
@@ -189,10 +193,11 @@ In what follows we denote this operator by a matrix $P$
 The operator $P$ is called the **orthogonal projection mapping onto** $S$.
 
 ```{figure} /_static/lecture_specific/orth_proj/orth_proj_thm2.png
+:scale: 65%
 
 ```
 
-It is immediate from the OPT that for any $y \in \mathbb R^n$
+It is immediate from the {prf:ref}`opt` that for any $y \in \mathbb R^n$
 
 1. $P y \in S$ and
 1. $y - P y \perp S$
@@ -204,7 +209,7 @@ From this, we can deduce additional useful properties, such as
 
 For example, to prove 1, observe that $y  = P y  + y - P y$ and apply the Pythagorean law.
 
-#### Orthogonal Complement
+#### Orthogonal complement
 
 Let $S \subset \mathbb R^n$.
 
@@ -224,9 +229,12 @@ such that $y = x_1 + x_2$.
 
 Moreover, $x_1 = \hat E_S y$ and $x_2 = y - \hat E_S y$.
 
-This amounts to another version of the OPT:
+This amounts to another version of the {prf:ref}`opt`:
+
+```{prf:theorem} Orthogonal Projection Theorem (another version)
+:label: opt_another
 
-**Theorem**.  If $S$ is a linear subspace of $\mathbb R^n$, $\hat E_S y = P y$ and $\hat E_{S^{\perp}} y = M y$, then
+If $S$ is a linear subspace of $\mathbb R^n$, $\hat E_S y = P y$ and $\hat E_{S^{\perp}} y = M y$, then
 
 $$
 P y \perp M y
@@ -234,6 +242,7 @@ P y \perp M y
 y = P y + M y
  \quad \text{for all } \, y \in \mathbb R^n
 $$
+```
 
 The next figure illustrates
 
@@ -241,7 +250,7 @@ The next figure illustrates
 
 ```
 
-## Orthonormal Basis
+## Orthonormal basis
 
 An orthogonal set of vectors $O \subset \mathbb R^n$ is called an **orthonormal set** if $\| u \| = 1$ for all $u \in O$.
 
@@ -281,11 +290,13 @@ $$
 
 Combining this result with {eq}`pob` verifies the claim.
 
-### Projection onto an Orthonormal Basis
+### Projection onto an orthonormal basis
 
 When a subspace onto which we project is orthonormal, computing the projection simplifies:
 
-**Theorem** If $\{u_1, \ldots, u_k\}$ is an orthonormal basis for $S$, then
+````{prf:theorem} 
+
+If $\{u_1, \ldots, u_k\}$ is an orthonormal basis for $S$, then
 
 ```{math}
 :label: exp_for_op
@@ -294,14 +305,16 @@ P y = \sum_{i=1}^k \langle y, u_i \rangle u_i,
 \quad
 \forall \; y \in \mathbb R^n
 ```
+````
 
-Proof: Fix $y \in \mathbb R^n$ and let $P y$ be  defined as in {eq}`exp_for_op`.
+```{prf:proof}  
+Fix $y \in \mathbb{R}^n$ and let $P y$ be  defined as in {eq}`exp_for_op`.
 
 Clearly, $P y \in S$.
 
 We claim that $y - P y \perp S$ also holds.
 
-It sufficies to show that $y - P y \perp$ any basis vector $u_i$.
+It suffices to show that $y - P y \perp u_i$ for any basis vector $u_i$.
 
 This is true because
 
@@ -310,10 +323,12 @@ $$
 = \langle y, u_j \rangle  - \sum_{i=1}^k \langle y, u_i \rangle
 \langle u_i, u_j  \rangle = 0
 $$
+```
 
 (Why is this sufficient to establish the claim that $y - P y \perp S$?)
 
-## Projection Via Matrix Algebra
+
+## Projection via matrix algebra
 
 Let $S$ be a linear subspace of $\mathbb R^n$ and  let $y \in \mathbb R^n$.
 
@@ -323,17 +338,22 @@ $$
 \hat E_S y = P y
 $$
 
-Evidently  $Py$ is a linear function from $y \in \mathbb R^n$ to $P y \in \mathbb R^n$.
+Evidently $Py$ is a linear function from $y \in \mathbb R^n$ to $P y \in \mathbb R^n$.
 
 [This reference](https://en.wikipedia.org/wiki/Linear_map#Matrices) is useful.
 
-**Theorem.** Let the columns of $n \times k$ matrix $X$ form a basis of $S$.  Then
+```{prf:theorem} 
+:label: proj_matrix
+
+Let the columns of $n \times k$ matrix $X$ form a basis of $S$.  Then
 
 $$
 P = X (X'X)^{-1} X'
 $$
+```
 
-Proof: Given arbitrary $y \in \mathbb R^n$ and $P = X (X'X)^{-1} X'$, our claim is that
+```{prf:proof} 
+Given arbitrary $y \in \mathbb R^n$ and $P = X (X'X)^{-1} X'$, our claim is that
 
 1. $P y \in S$, and
 2. $y - P y \perp S$
@@ -367,28 +387,29 @@ y]
 $$
 
 The proof is now complete.
+```
 
-### Starting with the Basis
+### Starting with the basis
 
 It is common in applications to start with $n \times k$ matrix $X$  with linearly independent columns and let
 
 $$
-S := \mathop{\mathrm{span}} X := \mathop{\mathrm{span}} \{\col_1 X, \ldots, \col_k X \}
+S := \mathop{\mathrm{span}} X := \mathop{\mathrm{span}} \{\mathop{\mathrm{col}}_1 X, \ldots, \mathop{\mathrm{col}}_k X \}
 $$
 
 Then the columns of $X$ form a basis of $S$.
 
-From the preceding theorem, $P = X (X' X)^{-1} X' y$ projects $y$ onto $S$.
+From the {prf:ref}`proj_matrix`, $P = X (X' X)^{-1} X' y$ projects $y$ onto $S$.
 
 In this context, $P$ is often called the **projection matrix**
 
 * The matrix $M = I - P$ satisfies $M y = \hat E_{S^{\perp}} y$ and is sometimes called the **annihilator matrix**.
 
-### The Orthonormal Case
+### The orthonormal case
 
 Suppose that $U$ is $n \times k$ with orthonormal columns.
 
-Let $u_i := \mathop{\mathrm{col}} U_i$ for each $i$, let $S := \mathop{\mathrm{span}} U$ and let $y \in \mathbb R^n$.
+Let $u_i := \mathop{\mathrm{col}}_i U$ for each $i$, let $S := \mathop{\mathrm{span}} U$ and let $y \in \mathbb R^n$.
 
 We know that the projection of $y$ onto $S$ is
 
@@ -409,13 +430,13 @@ $$
 We have recovered our earlier result about projecting onto the span of an orthonormal
 basis.
 
-### Application: Overdetermined Systems of Equations
+### Application: overdetermined systems of equations
 
 Let $y \in \mathbb R^n$ and let $X$ be $n \times k$ with linearly independent columns.
 
 Given $X$ and $y$, we seek $b \in \mathbb R^k$ that  satisfies the system of linear equations $X b = y$.
 
-If $n > k$ (more equations than unknowns), then $b$ is said to be **overdetermined**.
+If $n > k$ (more equations than unknowns), then the system is said to be **overdetermined**.
 
 Intuitively, we may not be able to find a $b$ that satisfies all $n$ equations.
 
@@ -428,15 +449,19 @@ By approximate solution, we mean a $b \in \mathbb R^k$ such that $X b$ is  close
 
 The next theorem shows that a best approximation is well defined and unique.
 
-The proof uses the OPT.
+The proof uses the {prf:ref}`opt`.
+
+```{prf:theorem} 
 
-**Theorem** The unique minimizer of  $\| y - X b \|$ over $b \in \mathbb R^K$ is
+The unique minimizer of  $\| y - X b \|$ over $b \in \mathbb R^k$ is
 
 $$
 \hat \beta := (X' X)^{-1} X' y
 $$
+```
 
-Proof:  Note that
+```{prf:proof}  
+Note that
 
 $$
 X \hat \beta = X (X' X)^{-1} X' y =
@@ -454,20 +479,21 @@ Because $Xb \in \mathop{\mathrm{span}}(X)$
 
 $$
 \| y - X \hat \beta \|
-\leq \| y - X b \| \text{ for any } b \in \mathbb R^K
+\leq \| y - X b \| \text{ for any } b \in \mathbb R^k
 $$
 
 This is what we aimed to show.
+```
 
-## Least Squares Regression
+## Least squares regression
 
 Let's apply the theory of orthogonal projection to least squares regression.
 
-This approach provides insights about  many geometric  properties of linear regression.
+This approach provides insights about many geometric properties of linear regression.
 
 We treat only some examples.
 
-### Squared Risk Measures
+### Squared risk measures
 
 Given pairs $(x, y) \in \mathbb R^K \times \mathbb R$, consider choosing $f \colon \mathbb R^K \to \mathbb R$ to minimize
 the **risk**
@@ -592,15 +618,17 @@ Here are some more standard definitions:
 * The **sum of squared residuals** is $:= \| \hat u \|^2$.
 * The **explained sum of squares** is $:= \| \hat y \|^2$.
 
-> TSS = ESS + SSR
+$$
+\text{TSS} = \text{ESS} + \text{SSR}
+$$
 
-We can prove this easily using the OPT.
+We can prove this easily using the {prf:ref}`opt`.
 
-From the OPT we have $y =  \hat y + \hat u$ and $\hat u \perp \hat y$.
+From the {prf:ref}`opt` we have $y =  \hat y + \hat u$ and $\hat u \perp \hat y$.
 
 Applying the Pythagorean law completes the proof.
 
-## Orthogonalization and Decomposition
+## Orthogonalization and decomposition
 
 Let's return to the connection between linear independence and orthogonality touched on above.
 
@@ -609,9 +637,11 @@ A result of much interest is a famous algorithm for constructing orthonormal set
 The next section gives details.
 
 (gram_schmidt)=
-### Gram-Schmidt Orthogonalization
+### Gram-Schmidt orthogonalization
 
-**Theorem** For each linearly independent set $\{x_1, \ldots, x_k\} \subset \mathbb R^n$, there exists an
+```{prf:theorem} 
+
+For each linearly independent set $\{x_1, \ldots, x_k\} \subset \mathbb R^n$, there exists an
 orthonormal set $\{u_1, \ldots, u_k\}$ with
 
 $$
@@ -620,6 +650,7 @@ $$
 \quad \text{for} \quad
 i = 1, \ldots, k
 $$
+```
 
 The **Gram-Schmidt orthogonalization** procedure constructs an orthogonal set $\{ u_1, u_2, \ldots, u_n\}$.
 
@@ -635,20 +666,23 @@ A Gram-Schmidt orthogonalization construction is a key idea behind the Kalman fi
 
 In some exercises below, you are asked to implement this algorithm and test it using projection.
 
-### QR Decomposition
+### QR decomposition
 
 The following result uses the preceding algorithm to produce a useful decomposition.
 
-**Theorem** If $X$ is $n \times k$ with linearly independent columns, then there exists a factorization $X = Q R$ where
+```{prf:theorem} 
+
+If $X$ is $n \times k$ with linearly independent columns, then there exists a factorization $X = Q R$ where
 
 * $R$ is $k \times k$, upper triangular, and nonsingular
 * $Q$ is $n \times k$ with orthonormal columns
+```
 
-Proof sketch: Let
+```{prf:proof} Let
 
-* $x_j := \col_j (X)$
+* $x_j := \mathop{\mathrm{col}}_j (X)$
 * $\{u_1, \ldots, u_k\}$ be orthonormal with the same span as $\{x_1, \ldots, x_k\}$ (to be constructed using Gram--Schmidt)
-* $Q$ be formed from cols $u_i$
+* $Q$ be formed from columns $u_i$
 
 Since $x_j \in \mathop{\mathrm{span}}\{u_1, \ldots, u_j\}$, we have
 
@@ -658,8 +692,9 @@ x_j = \sum_{i=1}^j \langle u_i, x_j  \rangle u_i
 $$
 
 Some rearranging gives $X = Q R$.
+```
 
-### Linear Regression via QR Decomposition
+### Linear regression via QR decomposition
 
 For matrices $X$ and $y$ that overdetermine $\beta$ in the linear
 equation system $y = X \beta$, we found  the least squares approximator $\hat \beta = (X' X)^{-1} X' y$.
@@ -671,11 +706,12 @@ $$
     \hat \beta
     & = (R'Q' Q R)^{-1} R' Q' y \\
     & = (R' R)^{-1} R' Q' y \\
-    & = R^{-1} (R')^{-1} R' Q' y
-        = R^{-1} Q' y
+    & = R^{-1} Q' y
 \end{aligned}
 $$
 
+where the last step uses the fact that $(R' R)^{-1} R' = R^{-1}$ since $R$ is nonsingular.
+
 Numerical routines would in this case use the alternative form $R \hat \beta = Q' y$ and back substitution.
 
 ## Exercises
@@ -713,9 +749,13 @@ intuition as to why they should be idempotent?
 ```
 
 Symmetry and idempotence of $M$ and $P$ can be established
-using standard rules for matrix algebra. The intuition behind
+using standard rules for matrix algebra. 
+
+The intuition behind
 idempotence of $M$ and $P$ is that both are orthogonal
-projections. After a point is projected into a given subspace, applying
+projections. 
+
+After a point is projected into a given subspace, applying
 the projection again makes no difference (A point inside the subspace
 is not shifted by orthogonal projection onto that space because it is
 already the closest point in the subspace to itself).
@@ -788,16 +828,16 @@ def gram_schmidt(X):
     U = np.empty((n, k))
     I = np.eye(n)
 
-    # The first col of U is just the normalized first col of X
-    v1 = X[:,0]
-    U[:, 0] = v1 / np.sqrt(v1 @ v1)
+    # The first column of U is just the normalized first column of X
+    v1 = X[:, 0]
+    U[:, 0] = v1 / np.sqrt(np.sum(v1 * v1))
 
     for i in range(1, k):
         # Set up
         b = X[:, i]       # The vector we're going to project
-        Z = X[:, 0:i]     # First i-1 columns of X
+        Z = X[:, :i]      # First i-1 columns of X
 
-        # Project onto the orthogonal complement of the col span of Z
+        # Project onto the orthogonal complement of the columns span of Z
         M = I - Z @ np.linalg.inv(Z.T @ Z) @ Z.T
         u = M @ b