Sampling for Scientific Data Analysis and Reduction
With exascale supercomputers on the horizon, data-driven in situ data reduction is a very important topic that potentially enables post hoc data visualization, reconstruction, and exploration with the goal of minimal information loss. Sophisticated sampling methods provide a fast approximation to the data that can be used as a preview to the simulation output without the need for full data reconstruction. More detailed analysis can then be performed by reconstructing the sampled data set as necessary. Other data reduction methods such as compression techniques can still be used with the sampled outputs to achieve further data reduction. Sampling can be achieved in the spatial domain (which data locations are to be stored?) and/or temporal domain (which time steps to be stored?). Given a spatial location, data-driven sampling approaches take into account its local properties (such as scalar value, local smoothness etc.) and multivariate association among scalar values to determine the importance of a location. For temporal sampling, changes in the local and global properties across time steps are taken into account as importance criteria. In this chapter, spatial sampling approaches are discussed for univariate and multivariate data sets and their use for effective in situ data reduction is demonstrated.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save
Springer+ Basic
€32.70 /Month
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (France)
eBook EUR 96.29 Price includes VAT (France)
Softcover Book EUR 126.59 Price includes VAT (France)
Hardcover Book EUR 179.34 Price includes VAT (France)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Correlation-aware probabilistic data summarization for large-scale multi-block scientific data visualization
Article Open access 18 March 2023
High-Performance Spatial Data Compression for Scientific Applications
Chapter © 2022
In Situ Analysis and Visualization of Extreme-Scale Particle Simulations
Chapter © 2022
References
- Ahrens, J., Geveci, B., Law, C.: Paraview: An end-user tool for large data visualization. The Visualization Handbook, vol. 717 (2005) Google Scholar
- Ahrens, J., Jourdain, S., O’Leary, P., Patchett, J., Rogers, D.H., Petersen, M.: An image-based approach to extreme scale in situ visualization and analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 424–434. IEEE Press (2014) Google Scholar
- Akiba, H., Ma, K., Chen, J.H., Hawkes, E.R.: Visualizing multivariate volume data from turbulent combustion simulations. Comput. Sci. Eng. 9(2), 76–83 (2007). https://doi.org/10.1109/MCSE.2007.42
- Almgren, A.S., Bell, J.B., Lijewski, M.J., Lukić, Z., Van Andel, E.: Nyx: a massively parallel AMR code for computational cosmology. apj 765, 39 (2013). https://doi.org/10.1088/0004-637X/765/1/39
- Alted, F.: BLOSC (2009). http://blosc.pytables.org/. [online]
- Ayachit, U., Bauer, A., Geveci, B., O’Leary, P., Moreland, K., Fabian, N., Mauldin, J.: Paraview catalyst: enabling in situ data analysis and visualization. In: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, pp. 25–29. ACM (2015) Google Scholar
- Ayachit, U., Whitlock, B., Wolf, M., Loring, B., Geveci, B., Lonie, D., Bethel, E.W.: The sensei generic in situ interface. In: 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV), pp. 40–44 (2016). https://doi.org/10.1109/ISAV.2016.013
- Battle, L., Stonebraker, M., Chang, R.: Dynamic reduction of query result sets for interactive visualizaton. In: 2013 IEEE International Conference on Big Data, pp. 1–8 (2013). https://doi.org/10.1109/BigData.2013.6691708
- Bauer, A.C., et al.: In Situ methods, infrastructures, and applications on high performance computing platforms, a state-of-the-art (STAR) report. In: Computer Graphics Forum, Proceedings of Eurovis 2016, vol. 35(3) (2016). LBNL-1005709 Google Scholar
- Bilmes, J.: A gentle tutorial on the em algorithm including gaussian mixtures and baum-welch. Technical report, International Computer Science Institute (1997) Google Scholar
- Biswas, A., Dutta, S., Shen, H., Woodring, J.: An information-aware framework for exploring multivariate data sets. IEEE Trans. Vis. Comput. Graph. 19(12), 2683–2692 (2013). https://doi.org/10.1109/TVCG.2013.133ArticleGoogle Scholar
- Biswas, A., Dutta, S., Pulido, J., Ahrens, J.: In situ data-driven adaptive sampling for large-scale simulation data summarization. In: Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV ’18, pp. 13–18. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3281464.3281467
- Chen, M., Feixas, M., Viola, I., Bardera, A., Shen, H., Sbert, M.: Information Theory Tools for Visualization. CRC Press, Boca Raton, FL, USA (2006) Google Scholar
- Childs, H., et al.: VisIt: an end-user tool for visualizing and analyzing very large data. In: High Performance Visualization—Enabling Extreme-Scale Scientific Insight, pp. 357–372. CRC Press/Francis–Taylor Group (2012) Google Scholar
- Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. In: Proceedings of the 27th annual meeting on Association for Computational Linguistics, ACL ’89, pp. 76–83. Association for Computational Linguistics, Stroudsburg, PA, USA (1989). https://doi.org/10.3115/981623.981633
- Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, 2nd edn. Wiley-Interscience, New York, NY, USA (2006) MATHGoogle Scholar
- Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with sz. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 730–739 (2016). https://doi.org/10.1109/IPDPS.2016.11
- Dutta, S., Biswas, A., Ahrens, J.: Multivariate pointwise information-driven data sampling and visualization. Entropy 21(7), 699 (2019) ArticleGoogle Scholar
- Dutta, S., Liu, X., Biswas, A., Shen, H.W., Chen, J.P.: Pointwise information guided visual analysis of time-varying multi-fields. In: SIGGRAPH Asia 2017 Symposium on Visualization, SA ’17, pp. 17:1–17:8. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3139295.3139298
- Fisher, D., Popov, I., Drucker, S., Schraefel, M.: Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pp. 1673–1682. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2207676.2208294
- Gosink, L., Anderson, J., Bethel, W., Joy, K.: Variable interactions in query-driven visualization. IEEE Trans. Vis. Comput. Graph. 13(6), 1400–1407 (2007). https://doi.org/10.1109/TVCG.2007.70519ArticleGoogle Scholar
- Gosink, L.J., Garth, C., Anderson, J.C., Bethel, E.W., Joy, K.I.: An application of multivariate statistical analysis for query-driven visualization. IEEE Trans. Vis. Comput. Graph. 17(3), 264–275 (2011). https://doi.org/10.1109/TVCG.2010.80ArticleGoogle Scholar
- Hazarika, S., Dutta, S., Shen, H., Chen, J.: Codda: a flexible copula-based distribution driven analysis framework for large-scale multivariate data. IEEE Trans. Vis. Comput. Graph. 25(1), 1214–1224 (2019). https://doi.org/10.1109/TVCG.2018.2864801ArticleGoogle Scholar
- Islam, A., Pearlman, W.A.: Embedded and efficient low-complexity hierarchical image coder. In: Electronic Imaging’99, pp. 294–305. International Society for Optics and Photonics (1998) Google Scholar
- Jänicke, H., Wiebel, A., Scheuermann, G., Kollmann, W.: Multifield visualization using local statistical complexity. IEEE Trans. Vis. Comput. Graph. 13(6), 1384–1391 (2007). https://doi.org/10.1109/TVCG.2007.70615ArticleGoogle Scholar
- Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957). https://doi.org/10.1103/PhysRev.106.620ArticleMathSciNetMATHGoogle Scholar
- Larsen, M., Ahrens, J., Ayachit, U., Brugger, E., Childs, H., Geveci, B., Harrison, C.: The alpine in situ infrastructure: Ascending from the ashes of strawman. In: Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization, ISAV’17, pp. 42–46. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3144769.3144778
- Li, S., Marsaglia, N., Chen, V., Sewell, C., Clyne, J., Childs, H.: Achieving portable performance for wavelet compression using data parallel primitives. In: Proceedings of the 17th Eurographics Symposium on Parallel Graphics and Visualization, PGV ’17, p. 73–81. Eurographics Association, Goslar, DEU (2017). https://doi.org/10.2312/pgv.20171095
- Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014) ArticleGoogle Scholar
- Lindstrom, P., Isenburg, M.: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph. 12(5), 1245–1250 (2006) ArticleGoogle Scholar
- Liu, X., Shen, H.W.: Association analysis for visual exploration of multivariate scientific data sets. IEEE Trans. Vis. Comput. Graph. 22(1), 955–964 (2016). https://doi.org/10.1109/TVCG.2015.2467431ArticleGoogle Scholar
- Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible io and integration for scientific codes through the adaptable io system (adios). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’08, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1383529.1383533
- Lu, K., Shen, H.W.: A compact multivariate histogram representation for query-driven visualization. In: Proceedings of the 2015 IEEE 5th Symposium on Large Data Analysis and Visualization (LDAV), LDAV ’15, pp. 49–56 (2015) Google Scholar
- Nouanesengsy, B., Woodring, J., Patchett, J., Myers, K., Ahrens, J.: ADR visualization: a generalized framework for ranking large-scale scientific data using analysis-driven refinement. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 43–50 (2014). https://doi.org/10.1109/LDAV.2014.7013203
- Park, Y., Cafarella, M., Mozafari, B.: Visualization-aware sampling for very large databases. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 755–766 (2016). https://doi.org/10.1109/ICDE.2016.7498287
- Patchett, J., Gisler, G.: Deep water impact ensemble data set. Los Alamos National Laboratory, LA-UR-17-21595 (2017). http://dssdata.org
- Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001). https://doi.org/10.1145/584091.584093ArticleMathSciNetGoogle Scholar
- Son, S., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., Choudhary, A.: Data compression for the exascale computing era - survey. Supercomput. Front. Innov. Int. J. 1(2), 76–88 (2014). https://doi.org/10.14529/jsfi140205
- Stockinger, K., Shalf, J., Wu, K., Bethel, E.W.: Query-driven visualization of large data sets. In: VIS 05. IEEE Visualization 2005, pp. 167–174 (2005). https://doi.org/10.1109/VISUAL.2005.1532792
- Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., Ahrens, J.: Taming massive distributed datasets: data sampling using bitmap indices. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’13, pp. 13–24. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2462902.2462906
- Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007). http://www.jstor.org/stable/25464608
- Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1129–1139 (2017). https://doi.org/10.1109/IPDPS.2017.115
- Tikhonova, A., Correa, C.D., Ma, K.: Explorable images for visualizing volume data. In: 2010 IEEE Pacific Visualization Symposium (PacificVis), pp. 177–184 (2010) Google Scholar
- Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In: Proceedings of the Workshop on Distributional Semantics and Compositionality, DiSCo ’11, pp. 16–20. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2043121.2043124
- Verdu, S.: Fifty years of Shannon theory. IEEE Trans. Inf. Theory 44(6), 2057–2078 (1998). https://doi.org/10.1109/18.720531ArticleMathSciNetMATHGoogle Scholar
- Wang, K., Kewei Lu, Wei, T., Shareef, N., Shen, H.: Statistical visualization and analysis of large data using a value-based spatial distribution. In: 2017 IEEE Pacific Visualization Symposium (PacificVis), pp. 161–170 (2017) Google Scholar
- Wang, W., Bruyere, C., Kuo, B., Scheitlin, T.: IEEE visualization 2004 contest data set (2004). NCAR. http://sciviscontest.ieeevis.org/2004/data.html
- Wei, T., Dutta, S., Shen, H.: Information guided data sampling and recovery using bitmap indexing. In: 2018 IEEE Pacific Visualization Symposium (PacificVis), pp. 56–65 (2018). https://doi.org/10.1109/PacificVis.2018.00016
- Woodring, J., Ahrens, J., Figg, J., Wendelberger, J., Habib, S., Heitmann, K.: In-situ sampling of a large-scale particle simulation for interactive visualization and analysis. Comput. Graph. Forum 30(3), 1151–1160 (2011). https://doi.org/10.1111/j.1467-8659.2011.01964.xArticleGoogle Scholar
- Ye, Y.C., Neuroth, T., Sauer, F., Ma, K., Borghesi, G., Konduri, A., Kolla, H., Chen, J.: In situ generated probability distribution functions for interactive post hoc visualization and analysis. In: 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV), pp. 65–74 (2016) Google Scholar
Acknowledgements
We would like to thank our Data Science at Scale Team colleagues: D. H. Rogers, L.-T. Lo, J. Patchett, our colleague from the Statistical Group CCS-6: Earl Lawrence, our industry partners at Kitware and other collaborators: C. Harrison, M. Larsen. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The Hurricane Isabel data set has kindly been provided by Wei Wang, Cindy Bruyere, Bill Kuo, and others at NCAR. Tim Scheitlin at NCAR converted the data into the Brick-of-Float format. The Turbulent Combustion data set is made available by Dr. Jacqueline Chen at Sandia National Laboratories through US Department of Energy’s SciDAC Institute for Ultrascale Visualization. This research was released under LA-UR-20-21090.
Author information
Authors and Affiliations
- Los Alamos National Lab, Los Alamos, NM, USA Ayan Biswas, Soumya Dutta, Terece L. Turton & James Ahrens
- Ayan Biswas