Posts tagged #ml-research
-
The Five-Minute Daily Drift Check
Solo research programs drift one small exception at a time. Three shell commands, run daily, catch the most common protocol violations before they compound into reproducibility failures.
-
Theorem-Screened Experiments
A three-step decision rule for running fewer, better experiments: check your theorem library before you touch a GPU. Calibration checks, falsifier traps, and parameter compression from the physicist mode of ML research.
-
Naming What Fails: The Obstacle Taxonomy
25+ preregistered kills over six weeks of compression research. The tempting story is "compression is hard." The physicist story is better: 25 kills, ~10 structural failure patterns, one Lean theorem per class.
-
Two Research Modes, and Why the Second One Needs Lean 4
AI makes hypothesis generation cheap. Evaluation stays expensive. Lean 4 proofs are the filter that changes the economics: a proved theorem screens an entire family of candidates before GPU time is allocated.
-
Zero-Sorry Discipline: What a Lean 4 Appendix Actually Costs
Two theorems in this paper — MoEGauge and JensenFloor — had to reach zero sorries before the paper shipped. What that process looks like, why sorry is dangerous, and what JensenFloor actually says.
-
What Experimental Design Actually Means
Theoretical physicists barely need it. Experimental physicists cannot live without it. Life sciences rewrote it for complexity. Pharma made it law. ML borrows the wrong one.
-
A Survey as a Living Document
What it means to maintain a formal proof corpus that stays in sync with its own coverage badge, and what happened when RWKV was added as a new architectural family without invalidating existing theorems.
-
How to Honestly Test if a Neural Network Can Be Compressed
Pre-registration, trap cells, τ-hardened baselines, and kill-fast protocols: a field methodology for compression research that tries to kill its own ideas. With actual results from OLMoE-1B-7B.
-
Hypothesis Testing from Scratch, and Its Bayesian Analogue
Frequentist hypothesis testing rebuilt from first principles for ML researchers who half-remember p-values. Then: the Bayesian reframe, why it fits the kill-ladder better, and what each one actually buys you.
-
A Catalogue of Symmetries Compression Must Respect
Compression schemes regularly violate algebraic invariants of weight structure—producing models that pass perplexity checks but fail downstream. Here are the five core symmetry types a formally verified survey is collecting.
-
Part E Pivot: FFN Rotation and the Narrow-d Falsification
After the KV-cache gauge, the obvious next move was applying β-lift to FFN weights. We tested it. It failed. Here is what the RAdam convergence probe and the 1-bit generation test actually showed.
-
The Microsite as Interactive Publication
Building a GitHub Pages research microsite with d3 widgets, a Lean 4 theorem status page, and a reproducibility shim. One gotcha with Jekyll and markdown-inside-divs. One real answer to whether live widgets are worth the effort.
-
Phase-Collapse Defragmentation: Why MoE KV-Cache Resists 1-bit Quantization
Attention head activations in Mixture-of-Experts models cluster around expert routing patterns. Quantizing the KV-cache destroys this signal. The MoEGauge framework builds provable bounds on exactly how much.
-
Adversarial Passes That Killed Claims
Two hypotheses that started as clean ideas and ended as documented failures: the DPO-CLM orthogonal-complement hypothesis, and the cross-probe srank claim retracted as a length-bias artifact.
-
Pre-Registration for Solo ML Researchers
How to borrow the clinical trial discipline of writing down what "pass" looks like before running the experiment — and why a SHA256 hash is the cheapest honesty enforcement mechanism available.
-
The Manuscript-as-Codebase Pattern
Hierarchical Makefiles, data-driven macro generation, and paper-scoped .gitignore: how treating a research paper like a software project caught hardcoded inconsistencies before reviewers did.
-
Stable Rank as an Overfitting Signature in LoRA Fine-Tuning
Why we picked stable rank to detect overfitting geometry in DPO vs CLM fine-tuning, how it connects to "alignment geometry," and what the BitFit baseline was there to check.