Method Summary

This page records the implementation-level understanding used to document the dp5.nmr_processing package. It is based on the DP4-AI paper, DP4-AI automated NMR data analysis: straight from spectrometer to structure (Chem. Sci., 2020, 11, 4351-4359), together with the supplementary information sections on automated NMR processing and assignment.

Paper Context

DP4-AI was introduced to remove the most labour-intensive part of classical DP4 analysis: manual extraction of peak positions, integrals, and assignments from 1D proton and carbon NMR spectra. The method couples two ideas:

robust automated processing of raw spectra,
probabilistic matching of calculated GIAO shifts to experimental peaks.

The paper’s central claim is not only that these steps can be automated, but that they can be automated without materially reducing stereochemical assignment performance relative to expert-curated DP4 input files.

In the codebase, that automation is concentrated in dp5.nmr_processing. The package provides the experimental side of the workflow, while dp5.dft and dp5.analysis consume the processed peak lists and calculated shifts.

Implementation Map

The current implementation divides the workflow into four layers.

Input handling

The dp5.nmr_processing.nmr_ai.NMRData class is the orchestration layer. It detects whether the user supplied Bruker data, JCAMP-DX data, or a manual description file. Raw FID inputs are converted into normalised frequency-domain spectra by dp5.nmr_processing.helper_functions, while text descriptions are parsed by dp5.nmr_processing.description_files.

Proton pipeline

The proton path follows the DP4-AI methodology most closely.

The FID is Fourier transformed, phased, baseline corrected, and analysed to estimate a correlation distance and noise level.
Candidate peaks are identified from minima in the second derivative while retaining a deliberately low threshold so that weak signal peaks are not missed.
Nearby peaks are grouped into provisional multiplets and each region is fit with a Pearson-VII line-shape model.
Individual component peaks are removed if doing so lowers the Bayesian Information Criterion enough to justify the simpler model.
Known solvent multiplets are identified from a solvent database, removed, and used to reference the spectrum.
The fitted line-shape model is integrated, a plausible total proton count is selected by maximising an integer-likeness score, and low-integral regions are removed as impurities.
The final multiplet centres and rounded integrals are passed to the proton assignment algorithm.

This logic is implemented mainly in dp5.nmr_processing.proton.process.proton_processing(), dp5.nmr_processing.proton.bic_minimisation.multiproc_BIC_minimisation(), and dp5.nmr_processing.proton.assign.iterative_assignment().

Carbon pipeline

The carbon path is simpler because integrals are not assignment constraints in the same way they are for proton spectra.

The spectrum is corrected and its noisy edges are zeroed.
An iterative peak-picking loop repeatedly fits the most intense remaining peak and keeps any maxima that still rise sufficiently above the fitted background.
Solvent peaks are removed using expected solvent patterns.
The assignment step uses peak position and amplitude together, because carbon spectra often contain low-intensity noise peaks and may have fewer experimental peaks than atoms.

This logic is implemented mainly in dp5.nmr_processing.carbon.process.carbon_processing() and dp5.nmr_processing.carbon.assign.iterative_assignment().

Assignment Logic

Shared idea

Both nuclei use an assignment probability matrix whose element M[i, j] measures how plausible it is for calculated shift i to correspond to experimental peak j. The final assignment is obtained with a Hungarian linear-sum optimisation, which is the same core strategy described in the paper.

External and internal scaling

The paper emphasises that internal DP4-style scaling cannot be used at the start because the assignments are not known yet. The code therefore performs an initial pass with fixed empirical external scaling factors and then recalculates the scaling relation from the provisional assignment. This two-stage logic is present in both the proton and carbon assignment modules.

Proton-specific constraints

The proton assignment implementation uses integral information explicitly. Expanded multiplet centres allow a peak to be assigned as many times as its rounded integral permits. Methyl protons are handled first as grouped units, which reflects the paper’s observation that methyl groups behave as equivalent signals and should be assigned to the same peak before the remaining protons are optimised.

Carbon-specific weighting

The carbon assignment implementation follows the DP4-AI strategy of using peak amplitudes to distinguish more reliable signal peaks from likely noise. A KDE of peak heights is used to derive amplitude groups and weights. The code then duplicates assignment columns so that one experimental peak may explain multiple equivalent carbons, while progressively penalising repeated use of the same peak. After two assignment rounds, a bias-driven reassignment step checks for cases where a nearby intense unassigned peak is chemically more plausible than a weak peak chosen in the initial optimisation.

Why the Proton Pipeline Is More Elaborate

The supplementary methods make clear that automated proton processing is harder than automated carbon processing because the final data product is not just a list of peak positions. The pipeline must also recover multiplet boundaries, remove over-picked noise peaks, estimate integer-like integrals, and preserve equivalence information needed by the proton assignment algorithm.

That distinction is visible in the code:

proton processing has dedicated modules for gradient peak picking, BIC-based deconvolution, analytic integration, and impurity filtering,
carbon processing delegates more of the complexity to the assignment stage, where amplitude weighting and repeat assignments compensate for noisier peak lists.

Relationship to Legacy Description Files

The package still supports hand-written NMR description files because DP4-style workflows pre-date automated raw-data processing. When a description file is used, the code bypasses the FID pipelines and instead parses manually supplied shifts, equivalence groups, and omitted atoms. This is why dp5.nmr_processing.description_files remains part of the public module: it preserves compatibility with curated inputs while the automated pipelines are used whenever raw spectra are available.

Practical Reading Guide

For code readers, the most useful entry points are:

dp5.nmr_processing.nmr_ai.NMRData for orchestration,
dp5.nmr_processing.proton.process.proton_processing() for the proton pipeline,
dp5.nmr_processing.carbon.process.carbon_processing() for the carbon pipeline,
dp5.nmr_processing.proton.assign.iterative_assignment() for integral-constrained proton assignment,
dp5.nmr_processing.carbon.assign.iterative_assignment() for amplitude-aware carbon assignment.

The API reference page links these functions directly to their source docstrings.