Writing
· 10 min read

Why You Can Start Gradient Descent Anywhere — Even on an NP-Hard Recovery Problem

With order-n complex Gaussian quadratic measurements, recovering a vector up to phase becomes both identifiable and solvable by gradient descent from any starting point — two faces of one geometric condition.

phase retrievalnonconvex optimizationquadratic feasibilitysample complexityoptimization landscape

Suppose I hand you a vector xCnx \in \mathbb{C}^n that you cannot see directly. All you get are mm numbers, each of the form Aix,x=ci\langle A_i x, x\rangle = c_i, where every AiA_i is a known Hermitian n×nn \times n matrix. Each measurement is a quadratic form — energy-like, sign-blind, and worst of all, blind to global phase: replace xx by eiφxe^{i\varphi}x and every cic_i is unchanged. Two questions hang over this immediately. First, is there even enough information to pin down xx (up to that unavoidable phase)? Second, even if there is, the obvious way to recover it is to minimize a nonconvex loss — and nonconvex least-squares over Cn\mathbb{C}^n is the kind of problem that, in general, is NP-hard. So why would any honest person expect to solve it from a cold start?

This is the quadratic feasibility problem, and it is the subject of work I led with my co-authors Gautam Dasarathy and Angelia Nedić at Arizona State. Read the paper · arXiv:2002.01066.

What is actually at stake, and what I am not claiming

Quadratic feasibility — find xx with Aix,x=ci\langle A_i x, x\rangle = c_i for all ii — is the feasibility core of a quadratically constrained quadratic program (QCQP). It is not a toy: it models power-system state estimation, x-ray crystallography, the turnpike problem, and unlabeled distance geometry. Its most famous special case is phase retrieval, where each Ai=aiaiA_i = a_i a_i^* is rank one and the measurement collapses to Aix,x=ai,x2\langle A_i x, x\rangle = |\langle a_i, x\rangle|^2 — you observe only squared magnitudes. Our setting allows arbitrary, full-rank Hermitian AiA_i.

Let me fence off what I am not doing. I will assume the AiA_i are complex Gaussian random matrices; I will assume the measurements cic_i are exact (no noise model); and I will give you a landscape result, not a finite-time convergence rate. Those are real limitations and I will come back to them. The payoff in exchange is a clean, two-part statement about when this problem is solvable at all and whether a dumb algorithm solves it.

Accept what phase retrieval already taught us — then find the crack

The "benign nonconvex landscape" program is by now a beautiful body of work. Sun, Qu, and Wright showed that the phase-retrieval 2\ell_2 loss has no spurious local minima and only strict saddles — every local min is global, every critical point that isn't a min has a direction of strict negative curvature. Ge, Jin, and Zheng gave a unified geometric analysis for low-rank problems in the same spirit. So you might think: just cite them and go home.

Here is the crack. Those results live, for the most part, in Rn\mathbb{R}^n, and the real and complex problems are not the same problem with a cosmetic change of field. In the real case, the sign ambiguity xxx \leftrightarrow -x gives you exactly two isolated global minima. In the complex case, the phase ambiguity xeiφxx \leftrightarrow e^{i\varphi}x gives you a whole continuum of equivalent global minima. A landscape proof that counts isolated minima, or that leans on real differentiability, simply does not transport. Separately, Huang, Gupta, and Dokmanić showed complex quadratic systems with full-rank random matrices are solvable by gradient descent — but only from a carefully chosen spectral initialization. The question I cared about: can you drop the clever initialization entirely?

Why the obvious recovery objective fights back

The obvious move is to minimize
f(x)=1mi=1mAix,xci2,f(x) = \frac{1}{m}\sum_{i=1}^m \big|\langle A_i x, x\rangle - c_i\big|^2,
the squared residual, with f0f \ge 0 and f=0f = 0 exactly at feasible points. Two things go wrong at once. The loss is nonconvex, so a priori there could be spurious local minima where gradient descent gets stuck. And ff is not complex-differentiable — it depends on both xx and xˉ\bar{x} — so ordinary calculus on Cn\mathbb{C}^n does not even hand us a gradient to descend on. Lurking underneath both is the prior question of whether f=0f = 0 has an essentially unique answer at all.

The one idea: measure distance the way the problem does

Here is the move the whole paper rests on. Stop measuring how far apart xx and yy are with xy\|x - y\| — that metric is offended by phase, reporting a large distance between xx and eiφxe^{i\varphi}x even though they are the same point for us. Instead use the phase-invariant distance
d(x,y)=xxyyF.d(x,y) = \|xx^* - yy^*\|_F.
Here xxxx^* is the rank-one n×nn\times n Hermitian matrix (the "lifted" version of xx), and F\|\cdot\|_F is the Frobenius norm. The single fact that makes this the right yardstick: d(x,y)=0d(x,y) = 0 if and only if x=eiφyx = e^{i\varphi}y for some phase φ\varphi. The metric is zero exactly on the orbits we cannot and should not distinguish.

The mental image to keep: lift every vector to its outer product, and do all your geometry up there, where phase has already been quotiented out. Once you are in the world of xxxx^*, "recover xx up to phase" becomes "recover the matrix xxxx^*," and the awkward symmetry evaporates.

The small algebraic engine that powers everything is a polarization identity. Write u=xeiφyu = x - e^{i\varphi} y and v=x+eiφyv = x + e^{i\varphi} y, and define the symmetric outer product [ ⁣[u,v] ⁣]=12(uv+vu)[\![u,v]\!] = \tfrac12(uv^* + vu^*). Then
xxyy=[ ⁣[u,v] ⁣].xx^* - yy^* = [\![u,v]\!].
This converts "tell xx and yy apart" into "the AiA_i must not crush the rank-2 Hermitian matrix [ ⁣[u,v] ⁣][\![u,v]\!]." That is a statement about how a linear map — the AiA_i acting on matrices — treats low-rank matrices, exactly the territory where RIP-style isometry arguments live.

The smallest nontrivial case, with numbers

Let me make this concrete in C2\mathbb{C}^2. Take x=(1,0)x = (1,0) and y=(0,1)y = (0,1) — orthogonal unit vectors, clearly not equal up to any phase. Lift them:
xx=(1000),yy=(0001),xxyy=(1001).xx^* = \begin{pmatrix}1&0\\0&0\end{pmatrix},\quad yy^* = \begin{pmatrix}0&0\\0&1\end{pmatrix},\quad xx^*-yy^* = \begin{pmatrix}1&0\\0&-1\end{pmatrix}.
So d(x,y)2=xxyyF2=12+(1)2=2d(x,y)^2 = \|xx^*-yy^*\|_F^2 = 1^2 + (-1)^2 = 2, i.e. d(x,y)=2d(x,y)=\sqrt2. Now check the engine. With φ=0\varphi=0, u=xy=(1,1)u = x-y = (1,-1) and v=x+y=(1,1)v = x+y = (1,1), so
uv=(1111),vu=(1111),12(uv+vu)=(1001).uv^* = \begin{pmatrix}1&1\\-1&-1\end{pmatrix},\quad vu^* = \begin{pmatrix}1&-1\\1&-1\end{pmatrix},\quad \tfrac12(uv^*+vu^*) = \begin{pmatrix}1&0\\0&-1\end{pmatrix}.
That is exactly xxyyxx^*-yy^*. The identity is not symbol-pushing; it reproduces the matrix entrywise.

Now watch a single random measurement act on this. For a complex Gaussian Hermitian AA (real N(0,1)N(0,1) diagonal, complex N(0,1)N(0,1) off-diagonal fixed by symmetry), the inner product against our diagonal real M=xxyyM = xx^*-yy^* collapses to A,M=A11A22\langle A, M\rangle = A_{11} - A_{22} — a difference of two independent standard normals. Its expected square is Var(A11)+Var(A22)=1+1=2=d(x,y)2\operatorname{Var}(A_{11}) + \operatorname{Var}(A_{22}) = 1 + 1 = 2 = d(x,y)^2. So one random quadratic measurement is an unbiased estimate of the squared phase-invariant distance. I averaged A,M2|\langle A,M\rangle|^2 over 2×1052\times10^5 random Hermitian Gaussians and got 1.991.99, right on top of 22. Average enough of them and the estimate concentrates — the only remaining question is how many "enough" is.

From unbiasedness to identifiability: an isometry

Call the measurement map MA(x)=(A1x,x,,Amx,x)M_A(x) = (\langle A_1 x, x\rangle, \dots, \langle A_m x, x\rangle). Say MAM_A is (α,β)(\alpha,\beta)-stable if
αd(x1,x2)    MA(x1)MA(x2)2    βd(x1,x2).\alpha\, d(x_1,x_2) \;\le\; \|M_A(x_1) - M_A(x_2)\|_2 \;\le\; \beta\, d(x_1,x_2).
The structural fact, via Wang and Xu's phase-retrievability characterization: MAM_A is injective up to phase if and only if it is (α,β)(\alpha,\beta)-stable. Identifiability is a RIP-like sandwich in the metric dd — not analogous to one, literally the same statement. (The bookkeeping that turns injectivity into the two-sided bound on iAi,[ ⁣[u,v] ⁣]2\sum_i |\langle A_i, [\![u,v]\!]\rangle|^2 is in the paper.)

To earn that sandwich for Gaussian AiA_i, combine the unbiasedness above with a concentration bound: the empirical average 1miAi,xxyy2\tfrac1m\sum_i |\langle A_i, xx^*-yy^*\rangle|^2 stays within εd(x,y)2\varepsilon\, d(x,y)^2 of its mean except with probability decmε\le d\,e^{-cm\varepsilon}. That controls one pair (x,y)(x,y). To make it hold uniformly over the sphere, cover the sphere with a δ\delta-net of size Nδ(12/δ)n|N_\delta| \le (12/\delta)^n and take a union bound. This is where the sample complexity is born: the union bound multiplies the per-point failure decmεd\,e^{-cm\varepsilon} by the net size (12/δ)n(12/\delta)^n, so the exponent must beat the net, cmεnlog(12/δ)cm\varepsilon \gtrsim n\log(12/\delta). The nn sitting in the net's exponent is exactly what forces mnm \gtrsim n.

Theorem 1 (near-isometry / identifiability). If the Hermitian AiA_i are complex Gaussian and m>Cnm > Cn, then for any ξ(0,1)\xi \in (0,1), with probability 1ξ\ge 1-\xi,   αd(x,y)MA(x)MA(y)2βd(x,y)\;\alpha\, d(x,y) \le \|M_A(x)-M_A(y)\|_2 \le \beta\, d(x,y) for all x,yCnx,y \in \mathbb{C}^n, with explicit α=(12δ)2(1ε)(1+2δ)2\alpha = \tfrac{(1-2\delta)^2(1-\varepsilon)}{(1+2\delta)^2} and β=(1+2δ)2(1+ε)(12δ)2\beta = \tfrac{(1+2\delta)^2(1+\varepsilon)}{(1-2\delta)^2}. Distinct inputs give distinct measurements: the model is identifiable.

The same condition makes the landscape benign

Now the second face. Because ff is not holomorphic, we differentiate with Wirtinger calculus — treat xx and xˉ\bar x as independent variables. The goal is to rule out spurious minima, and the technique is to exhibit strict negative curvature everywhere a non-solution could hide: at every point xx with small gradient there is a direction Δ\Delta with
2f(x)Δ,Δ    c0xxzzF2  <  0whenever d(x,z)>0,\langle \nabla^2 f(x)\, \Delta, \Delta\rangle \;\le\; -c_0\, \|xx^* - zz^*\|_F^2 \;<\; 0 \quad\text{whenever } d(x,z) > 0,
where zz is the ground truth generating the cic_i. Notice the right-hand side is c0d(x,z)2-c_0\, d(x,z)^2 — the same phase-invariant distance, again doing the bookkeeping. Strict negative curvature at every non-optimal stationary point means no flat traps.

Theorem 2 (benign landscape). For complex Gaussian AiA_i with m>Cnm > Cn, with probability 1ξ\ge 1-\xi: ff is strict-saddle, and every local minimum ww satisfies d(w,z)=0d(w,z)=0. No spurious local minima; every local min is global up to phase. Corollary. A gradient method from an arbitrary initialization therefore converges to a global minimum, via the saddle-avoidance results of Lee, Simchowitz, Jordan, and Recht.

The one sentence to keep for a week: identifiability and tractable optimization are governed by one and the same geometric condition — the (α,β)(\alpha,\beta)-stability that complex Gaussian measurements buy you at mnm \gtrsim n.

What the result really says, and where it goes vacuous

The defended claim, with its boundary: order-nn Gaussian quadratic measurements suffice for both identifiability and a globally optimizable landscape, with no clever initialization. That last clause is the sharp contrast with Huang–Gupta–Dokmanić, who needed a spectral start.

Now the honest limits. The guarantees are for complex Gaussian AiA_i — not structured, deterministic, or coded measurements. The constants (CC, α\alpha, β\beta, c0c_0) are existence-type and unoptimized; m>Cnm > Cn is order-nn, not a sharp threshold, and pushing δ,ε\delta,\varepsilon smaller (stronger stability) inflates CC — a real tradeoff. The corollary gives asymptotic saddle-avoidance, not a finite-time rate. There is no noise model: cic_i are exact. And to be candid, this manuscript is theory-focused; I would not have you take a benchmark number from me here.

The load-bearing assumption is the Gaussianity, and it is exactly where the result threatens to go vacuous: every "with high probability" rides on the second-moment identity EAi,xxyy2=d(x,y)2\mathbb{E}|\langle A_i, xx^*-yy^*\rangle|^2 = d(x,y)^2 and on the concentration bound above. Break those — heavy tails, strong structure, adversarial AiA_i — and neither the isometry nor the curvature lower bound is guaranteed.

What I would do next

The cleanest open question: how far can the Gaussian assumption be relaxed — to sub-Gaussian, coded-diffraction, or sparse AiA_i — while keeping the landscape benign? After that: the tight sample-complexity constant (is mnm \asymp n optimal?); finite-time iteration complexity to replace the qualitative convergence; a noise/robustness model with stable recovery bounds; and exploiting structure (sparse or low-rank xx). If you take one thing away, let it be the reframing: lift to xxxx^*, measure with dd, and identifiability and optimizability turn out to be the same fact seen twice.