Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study
Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.
Generative models for synthetic data generation: application to pharmacokinetic/pharmacodynamic data
The generation of synthetic patient data that reflect the statistical properties of real data plays a fundamental role in today’s world because of its potential to (i) be enable proprietary data access for statistical and research purposes and (ii) increase available data (e.g., in low-density regions—i.e., for patients with under-represented characteristics). Generative methods employ a family of solutions for generating synthetic data. The objective of this research is to benchmark numerous state-of-the-art deep-learning generative methods across different scenarios and clinical datasets comprising patient covariates and several pharmacokinetic/pharmacodynamic endpoints. We did this by implementing various probabilistic models aimed at generating synthetic data, such as the Multi-layer Perceptron Conditioning Generative Adversarial Neural Network (MLP cGAN), Time-series Generative Adversarial Networks (TimeGAN), and a more traditional approach like Probabilistic Autoregressive (PAR). We evaluated their performance by calculating discriminative and predictive scores. Furthermore, we conducted comparisons between the distributions of real and synthetic data using Kolmogorov-Smirnov and Chi-square statistical tests, focusing respectively on covariate and output variables of the models. Lastly, we employed pharmacometrics-related metric to enhance interpretation of our results specific to our investigated scenarios. Results indicate that multi-layer perceptron–based conditional generative adversarial networks (MLP cGAN) exhibit the best overall performance for most of the considered metrics. This work highlights the opportunities to employ synthetic data generation in the field of clinical pharmacology for augmentation and sharing of proprietary data across institutions.
Synthetic Data Applications in Finance
- 2023/12/29
- Review
- Vamsi K. Potluru,
- Daniel Borrajo,
- Andrea Coletta,
- Niccolò Dalmasso,
- Yousef El-Laham,
- Elizabeth Fons,
- Mohsen Ghassemi,
- Sriram Gopalakrishnan,
- Vikesh Gosai,
- Eleonora Kreačić,
- Ganapathy Mani,
- Saheed Obitayo,
- Deepak Paramanand,
- Natraj Raman,
- Mikhail Solonin,
- Srijan Sood,
- Svitlana Vyetrenko,
- Haibei Zhu,
- Manuela Veloso,
- Tucker Balch
Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured arising from both markets and retail financial applications. Since finance is a highly regulated industry, synthetic data is a potential approach for dealing with issues related to privacy, fairness, and explainability. Various metrics are utilized in evaluating the quality and effectiveness of our approaches in these applications. We conclude with open directions in synthetic data in the context of the financial domain.
Time-series Generation by Contrastive Imitation
Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.
Conditional Sig-Wasserstein GANs for Time Series Generation
Generative adversarial networks (GANs) have been extremely successful in generating samples, from seemingly high dimensional probability measures. However, these methods struggle to capture the temporal dependence of joint probability distributions induced by time-series data. Furthermore, long time-series data streams hugely increase the dimension of the target space, which may render generative modelling infeasible. To overcome these challenges, motivated by the autoregressive models in econometric, we are interested in the conditional distribution of future time series given the past information. We propose the generic conditional Sig-WGAN framework by integrating Wasserstein-GANs (WGANs) with mathematically principled and efficient path feature extraction called the signature of a path. The signature of a path is a graded sequence of statistics that provides a universal description for a stream of data, and its expected value characterises the law of the time-series model. In particular, we develop the conditional Sig-W1 metric, that captures the conditional joint law of time series models, and use it as a discriminator. The signature feature space enables the explicit representation of the proposed discriminators which alleviates the need for expensive training. We validate our method on both synthetic and empirical dataset and observe that our method consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.
Deep Learning for Time Series Forecasting: Advances and Open Problems
A time series is a sequence of time-ordered data, and it is generally used to describe how a phenomenon evolves over time. Time series forecasting, estimating future values of time series, allows the implementation of decision-making strategies. Deep learning, the currently leading field of machine learning, applied to time series forecasting can cope with complex and high-dimensional time series that cannot be usually handled by other machine learning techniques. The aim of the work is to provide a review of state-of-the-art deep learning architectures for time series forecasting, underline recent advances and open problems, and also pay attention to benchmark data sets. Moreover, the work presents a clear distinction between deep learning architectures that are suitable for short-term and long-term forecasting. With respect to existing literature, the major advantage of the work consists in describing the most recent architectures for time series forecasting, such as Graph Neural Networks, Deep Gaussian Processes, Generative Adversarial Networks, Diffusion Models, and Transformers.
Fully Embedded Time-Series Generative Adversarial Networks
Generative Adversarial Networks (GANs) should produce synthetic data that fits the underlying distribution of the data being modeled. For real valued time-series data, this implies the need to simultaneously capture the static distribution of the data, but also the full temporal distribution of the data for any potential time horizon. This temporal element produces a more complex problem that can potentially leave current solutions under-constrained, unstable during training, or prone to varying degrees of mode collapse. In FETSGAN, entire sequences are translated directly to the generator’s sampling space using a seq2seq style adversarial autoencoder (AAE), where adversarial training is used to match the training distribution in both the feature space and the lower dimensional sampling space. This additional constraint provides a loose assurance that the temporal distribution of the synthetic samples will not collapse. In addition, the First Above Threshold (FAT) operator is introduced to supplement the reconstruction of encoded sequences, which improves training stability and the overall quality of the synthetic data being generated. These novel contributions demonstrate a significant improvement to the current state of the art for adversarial learners in qualitative measures of temporal similarity and quantitative predictive ability of data generated through FETSGAN.
Limit Order Book Simulation with Generative Adversarial Networks
We propose a nonparametric method for simulating the dynamics of a limit order book using Generative Adversarial Networks (GAN) to learn the conditional distribution of the future state of the order book given its current state from time series of the limit order book. Our method yields a scenario generator for limit order books which captures a range of stylized facts and salient properties of limit order book transitions. We show that the trained generator is also able to correctly reproduce some key properties observed in empirical studies on market impact. In particular, the model exhibits a decaying marginal impact of trade size, higher impact of aggressive orders, as well as a decreasing relation between impact and order book depth.
Non-adversarial training of Neural SDEs with signature kernel scores
Neural SDEs are continuous-time generative models for sequential data. Stateof-the-art performance for irregular time series generation has been previously obtained by training these models adversarially as GANs. However, as typical for GAN architectures, training is notoriously unstable, often suffers from mode collapse, and requires specialised techniques such as weight clipping and gradient penalty to mitigate these issues. In this paper, we introduce a novel class of scoring rules on pathspace based on signature kernels and use them as objective for training Neural SDEs non-adversarially. By showing strict properness of such kernel scores and consistency of the corresponding estimators, we provide existence and uniqueness guarantees for the minimiser. With this formulation, evaluating the generator-discriminator pair amounts to solving a system of linear path-dependent PDEs which allows for memory-efficient adjoint-based backpropagation. Moreover, because the proposed kernel scores are well-defined for paths with values in infinite dimensional spaces of functions, our framework can be easily extended to generate spatiotemporal data. Our procedure permits conditioning on a rich variety of market conditions and significantly outperforms alternative ways of training Neural SDEs on a variety of tasks including the simulation of rough volatility models, the conditional probabilistic forecasts of real-world forex pairs where the conditioning variable is an observed past trajectory, and the mesh-free generation of limit order book dynamics.
Multi-dimensional Geometric Brownian motion
NASDAQ public exchange
Rough Volatility model
PCF-GAN: generating sequential data via the characteristic function of measures on the path space
Generating high-fidelity time series data using generative adversarial networks
(GANs) remains a challenging task, as it is difficult to capture the temporal dependence of joint probability distributions induced by time-series data. Towards this goal, a key step is the development of an effective discriminator to distinguish
between time series distributions. We propose the so-called PCF-GAN, a novel
GAN that incorporates the path characteristic function (PCF) as the principled
representation of time series distribution into the discriminator to enhance its generative performance. On the one hand, we establish theoretical foundations of the
PCF distance by proving its characteristicity, boundedness, differentiability with
respect to generator parameters, and weak continuity, which ensure the stability
and feasibility of training the PCF-GAN. On the other hand, we design efficient
initialisation and optimisation schemes for PCFs to strengthen the discriminative power and accelerate training efficiency. To further boost the capabilities
of complex time series generation, we integrate the auto-encoder structure via
sequential embedding into the PCF-GAN, which provides additional reconstruction functionality. Extensive numerical experiments on various datasets demonstrate the consistently superior performance of PCF-GAN over state-of-the-art
baselines, in both generation and reconstruction quality.
OU Process
Rough Volatility model
Yahoo stock data
UCI Energy and Air data
EEG
Discriminative score
Predictive score
Sig-MMD