Method Summary
This page records the implementation-level understanding used to document the
dp5.nmr_processing package. It is based on the DP4-AI paper,
DP4-AI automated NMR data analysis: straight from spectrometer to structure
(Chem. Sci., 2020, 11, 4351-4359), together with the supplementary information
sections on automated NMR processing and assignment.
Paper Context
DP4-AI was introduced to remove the most labour-intensive part of classical DP4 analysis: manual extraction of peak positions, integrals, and assignments from 1D proton and carbon NMR spectra. The method couples two ideas:
robust automated processing of raw spectra,
probabilistic matching of calculated GIAO shifts to experimental peaks.
The paper’s central claim is not only that these steps can be automated, but that they can be automated without materially reducing stereochemical assignment performance relative to expert-curated DP4 input files.
In the codebase, that automation is concentrated in
dp5.nmr_processing. The package provides the experimental side of the
workflow, while dp5.dft and dp5.analysis consume the processed
peak lists and calculated shifts.
Implementation Map
The current implementation divides the workflow into four layers.
Input handling
The dp5.nmr_processing.nmr_ai.NMRData class is the orchestration layer.
It detects whether the user supplied Bruker data, JCAMP-DX data, or a manual
description file. Raw FID inputs are converted into normalised frequency-domain
spectra by dp5.nmr_processing.helper_functions, while text descriptions
are parsed by dp5.nmr_processing.description_files.
Proton pipeline
The proton path follows the DP4-AI methodology most closely.
The FID is Fourier transformed, phased, baseline corrected, and analysed to estimate a correlation distance and noise level.
Candidate peaks are identified from minima in the second derivative while retaining a deliberately low threshold so that weak signal peaks are not missed.
Nearby peaks are grouped into provisional multiplets and each region is fit with a Pearson-VII line-shape model.
Individual component peaks are removed if doing so lowers the Bayesian Information Criterion enough to justify the simpler model.
Known solvent multiplets are identified from a solvent database, removed, and used to reference the spectrum.
The fitted line-shape model is integrated, a plausible total proton count is selected by maximising an integer-likeness score, and low-integral regions are removed as impurities.
The final multiplet centres and rounded integrals are passed to the proton assignment algorithm.
This logic is implemented mainly in
dp5.nmr_processing.proton.process.proton_processing(),
dp5.nmr_processing.proton.bic_minimisation.multiproc_BIC_minimisation(),
and dp5.nmr_processing.proton.assign.iterative_assignment().
Carbon pipeline
The carbon path is simpler because integrals are not assignment constraints in the same way they are for proton spectra.
The spectrum is corrected and its noisy edges are zeroed.
An iterative peak-picking loop repeatedly fits the most intense remaining peak and keeps any maxima that still rise sufficiently above the fitted background.
Solvent peaks are removed using expected solvent patterns.
The assignment step uses peak position and amplitude together, because carbon spectra often contain low-intensity noise peaks and may have fewer experimental peaks than atoms.
This logic is implemented mainly in
dp5.nmr_processing.carbon.process.carbon_processing() and
dp5.nmr_processing.carbon.assign.iterative_assignment().
Assignment Logic
External and internal scaling
The paper emphasises that internal DP4-style scaling cannot be used at the start because the assignments are not known yet. The code therefore performs an initial pass with fixed empirical external scaling factors and then recalculates the scaling relation from the provisional assignment. This two-stage logic is present in both the proton and carbon assignment modules.
Proton-specific constraints
The proton assignment implementation uses integral information explicitly. Expanded multiplet centres allow a peak to be assigned as many times as its rounded integral permits. Methyl protons are handled first as grouped units, which reflects the paper’s observation that methyl groups behave as equivalent signals and should be assigned to the same peak before the remaining protons are optimised.
Carbon-specific weighting
The carbon assignment implementation follows the DP4-AI strategy of using peak amplitudes to distinguish more reliable signal peaks from likely noise. A KDE of peak heights is used to derive amplitude groups and weights. The code then duplicates assignment columns so that one experimental peak may explain multiple equivalent carbons, while progressively penalising repeated use of the same peak. After two assignment rounds, a bias-driven reassignment step checks for cases where a nearby intense unassigned peak is chemically more plausible than a weak peak chosen in the initial optimisation.
Why the Proton Pipeline Is More Elaborate
The supplementary methods make clear that automated proton processing is harder than automated carbon processing because the final data product is not just a list of peak positions. The pipeline must also recover multiplet boundaries, remove over-picked noise peaks, estimate integer-like integrals, and preserve equivalence information needed by the proton assignment algorithm.
That distinction is visible in the code:
proton processing has dedicated modules for gradient peak picking, BIC-based deconvolution, analytic integration, and impurity filtering,
carbon processing delegates more of the complexity to the assignment stage, where amplitude weighting and repeat assignments compensate for noisier peak lists.
Relationship to Legacy Description Files
The package still supports hand-written NMR description files because DP4-style
workflows pre-date automated raw-data processing. When a description file is
used, the code bypasses the FID pipelines and instead parses manually supplied
shifts, equivalence groups, and omitted atoms. This is why
dp5.nmr_processing.description_files remains part of the public module:
it preserves compatibility with curated inputs while the automated pipelines are
used whenever raw spectra are available.
Practical Reading Guide
For code readers, the most useful entry points are:
dp5.nmr_processing.nmr_ai.NMRDatafor orchestration,dp5.nmr_processing.proton.process.proton_processing()for the proton pipeline,dp5.nmr_processing.carbon.process.carbon_processing()for the carbon pipeline,dp5.nmr_processing.proton.assign.iterative_assignment()for integral-constrained proton assignment,dp5.nmr_processing.carbon.assign.iterative_assignment()for amplitude-aware carbon assignment.
The API reference page links these functions directly to their source docstrings.