Nvidia open sources the synthetic data framework used to build Nemotron datasets NVIDIA just open sourced NeMo Data Designer, the synthetic data framework used internally to build both pre-training and post-training datasets for Nemotron. It lets you define an entire synthetic data pipeline directly in Python: structured outputs, statistical samplers, LLM-generated columns, dependency-aware field relationships, Python/SQL/remote validators, and optional LLM-as-judge scoring. Supports quick preview mode for fast iteration before scaling up. Install: ``` pip install data-designer ``` A minimal example: ``` from data_designer.essentials import * data_designer = DataDesigner() config = DataDesignerConfigBuilder() config.add_column( SamplerColumnConfig( name="product_category", sampler_type=SamplerType.CATEGORY, params=CategorySamplerParams( values=["Electronics", "Clothing", "Home & Kitchen", "Books"] ), ) ) config.add_column( LLMTextColumnConfig( name="review", model_alias="nvidia-text", prompt="Write a short product review for a {{ product_category }} item." ) ) preview = data_designer.preview(config_builder=config) preview.display_sample_record() ``` This release also incorporates the synthetic data tech my team originally built at Gretel (now part of NVIDIA), now generally available for anyone to use or extend. Repo: https://github.com/NVIDIA-NeMo/DataDesigner |