Sampling for Scientific Data Analysis and Reduction

With exascale supercomputers on the horizon, data-driven in situ data reduction is a very important topic that potentially enables post hoc data visualization, reconstruction, and exploration with the goal of minimal information loss. Sophisticated sampling methods provide a fast approximation to the data that can be used as a preview to the simulation output without the need for full data reconstruction. More detailed analysis can then be performed by reconstructing the sampled data set as necessary. Other data reduction methods such as compression techniques can still be used with the sampled outputs to achieve further data reduction. Sampling can be achieved in the spatial domain (which data locations are to be stored?) and/or temporal domain (which time steps to be stored?). Given a spatial location, data-driven sampling approaches take into account its local properties (such as scalar value, local smoothness etc.) and multivariate association among scalar values to determine the importance of a location. For temporal sampling, changes in the local and global properties across time steps are taken into account as importance criteria. In this chapter, spatial sampling approaches are discussed for univariate and multivariate data sets and their use for effective in situ data reduction is demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic €32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

eBook EUR 96.29 Price includes VAT (France)

Softcover Book EUR 126.59 Price includes VAT (France)

Hardcover Book EUR 179.34 Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Correlation-aware probabilistic data summarization for large-scale multi-block scientific data visualization

Article Open access 18 March 2023

High-Performance Spatial Data Compression for Scientific Applications

In Situ Analysis and Visualization of Extreme-Scale Particle Simulations

References

Ahrens, J., Geveci, B., Law, C.: Paraview: An end-user tool for large data visualization. The Visualization Handbook, vol. 717 (2005) Google Scholar
Ahrens, J., Jourdain, S., O’Leary, P., Patchett, J., Rogers, D.H., Petersen, M.: An image-based approach to extreme scale in situ visualization and analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 424–434. IEEE Press (2014) Google Scholar
Akiba, H., Ma, K., Chen, J.H., Hawkes, E.R.: Visualizing multivariate volume data from turbulent combustion simulations. Comput. Sci. Eng. 9(2), 76–83 (2007). https://doi.org/10.1109/MCSE.2007.42
Almgren, A.S., Bell, J.B., Lijewski, M.J., Lukić, Z., Van Andel, E.: Nyx: a massively parallel AMR code for computational cosmology. apj 765, 39 (2013). https://doi.org/10.1088/0004-637X/765/1/39
Alted, F.: BLOSC (2009). http://blosc.pytables.org/. [online]
Ayachit, U., Bauer, A., Geveci, B., O’Leary, P., Moreland, K., Fabian, N., Mauldin, J.: Paraview catalyst: enabling in situ data analysis and visualization. In: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, pp. 25–29. ACM (2015) Google Scholar
Ayachit, U., Whitlock, B., Wolf, M., Loring, B., Geveci, B., Lonie, D., Bethel, E.W.: The sensei generic in situ interface. In: 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV), pp. 40–44 (2016). https://doi.org/10.1109/ISAV.2016.013
Battle, L., Stonebraker, M., Chang, R.: Dynamic reduction of query result sets for interactive visualizaton. In: 2013 IEEE International Conference on Big Data, pp. 1–8 (2013). https://doi.org/10.1109/BigData.2013.6691708
Bauer, A.C., et al.: In Situ methods, infrastructures, and applications on high performance computing platforms, a state-of-the-art (STAR) report. In: Computer Graphics Forum, Proceedings of Eurovis 2016, vol. 35(3) (2016). LBNL-1005709 Google Scholar
Bilmes, J.: A gentle tutorial on the em algorithm including gaussian mixtures and baum-welch. Technical report, International Computer Science Institute (1997) Google Scholar
Biswas, A., Dutta, S., Shen, H., Woodring, J.: An information-aware framework for exploring multivariate data sets. IEEE Trans. Vis. Comput. Graph. 19(12), 2683–2692 (2013). https://doi.org/10.1109/TVCG.2013.133ArticleGoogle Scholar
Biswas, A., Dutta, S., Pulido, J., Ahrens, J.: In situ data-driven adaptive sampling for large-scale simulation data summarization. In: Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV ’18, pp. 13–18. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3281464.3281467
Chen, M., Feixas, M., Viola, I., Bardera, A., Shen, H., Sbert, M.: Information Theory Tools for Visualization. CRC Press, Boca Raton, FL, USA (2006) Google Scholar
Childs, H., et al.: VisIt: an end-user tool for visualizing and analyzing very large data. In: High Performance Visualization—Enabling Extreme-Scale Scientific Insight, pp. 357–372. CRC Press/Francis–Taylor Group (2012) Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. In: Proceedings of the 27th annual meeting on Association for Computational Linguistics, ACL ’89, pp. 76–83. Association for Computational Linguistics, Stroudsburg, PA, USA (1989). https://doi.org/10.3115/981623.981633
Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, 2nd edn. Wiley-Interscience, New York, NY, USA (2006) MATHGoogle Scholar
Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with sz. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 730–739 (2016). https://doi.org/10.1109/IPDPS.2016.11
Dutta, S., Biswas, A., Ahrens, J.: Multivariate pointwise information-driven data sampling and visualization. Entropy 21(7), 699 (2019) ArticleGoogle Scholar
Dutta, S., Liu, X., Biswas, A., Shen, H.W., Chen, J.P.: Pointwise information guided visual analysis of time-varying multi-fields. In: SIGGRAPH Asia 2017 Symposium on Visualization, SA ’17, pp. 17:1–17:8. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3139295.3139298
Fisher, D., Popov, I., Drucker, S., Schraefel, M.: Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pp. 1673–1682. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2207676.2208294
Gosink, L., Anderson, J., Bethel, W., Joy, K.: Variable interactions in query-driven visualization. IEEE Trans. Vis. Comput. Graph. 13(6), 1400–1407 (2007). https://doi.org/10.1109/TVCG.2007.70519ArticleGoogle Scholar
Gosink, L.J., Garth, C., Anderson, J.C., Bethel, E.W., Joy, K.I.: An application of multivariate statistical analysis for query-driven visualization. IEEE Trans. Vis. Comput. Graph. 17(3), 264–275 (2011). https://doi.org/10.1109/TVCG.2010.80ArticleGoogle Scholar
Hazarika, S., Dutta, S., Shen, H., Chen, J.: Codda: a flexible copula-based distribution driven analysis framework for large-scale multivariate data. IEEE Trans. Vis. Comput. Graph. 25(1), 1214–1224 (2019). https://doi.org/10.1109/TVCG.2018.2864801ArticleGoogle Scholar
Islam, A., Pearlman, W.A.: Embedded and efficient low-complexity hierarchical image coder. In: Electronic Imaging’99, pp. 294–305. International Society for Optics and Photonics (1998) Google Scholar
Jänicke, H., Wiebel, A., Scheuermann, G., Kollmann, W.: Multifield visualization using local statistical complexity. IEEE Trans. Vis. Comput. Graph. 13(6), 1384–1391 (2007). https://doi.org/10.1109/TVCG.2007.70615ArticleGoogle Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957). https://doi.org/10.1103/PhysRev.106.620ArticleMathSciNetMATHGoogle Scholar
Larsen, M., Ahrens, J., Ayachit, U., Brugger, E., Childs, H., Geveci, B., Harrison, C.: The alpine in situ infrastructure: Ascending from the ashes of strawman. In: Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization, ISAV’17, pp. 42–46. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3144769.3144778
Li, S., Marsaglia, N., Chen, V., Sewell, C., Clyne, J., Childs, H.: Achieving portable performance for wavelet compression using data parallel primitives. In: Proceedings of the 17th Eurographics Symposium on Parallel Graphics and Visualization, PGV ’17, p. 73–81. Eurographics Association, Goslar, DEU (2017). https://doi.org/10.2312/pgv.20171095
Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014) ArticleGoogle Scholar
Lindstrom, P., Isenburg, M.: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph. 12(5), 1245–1250 (2006) ArticleGoogle Scholar
Liu, X., Shen, H.W.: Association analysis for visual exploration of multivariate scientific data sets. IEEE Trans. Vis. Comput. Graph. 22(1), 955–964 (2016). https://doi.org/10.1109/TVCG.2015.2467431ArticleGoogle Scholar
Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible io and integration for scientific codes through the adaptable io system (adios). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’08, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1383529.1383533
Lu, K., Shen, H.W.: A compact multivariate histogram representation for query-driven visualization. In: Proceedings of the 2015 IEEE 5th Symposium on Large Data Analysis and Visualization (LDAV), LDAV ’15, pp. 49–56 (2015) Google Scholar
Nouanesengsy, B., Woodring, J., Patchett, J., Myers, K., Ahrens, J.: ADR visualization: a generalized framework for ranking large-scale scientific data using analysis-driven refinement. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 43–50 (2014). https://doi.org/10.1109/LDAV.2014.7013203
Park, Y., Cafarella, M., Mozafari, B.: Visualization-aware sampling for very large databases. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 755–766 (2016). https://doi.org/10.1109/ICDE.2016.7498287
Patchett, J., Gisler, G.: Deep water impact ensemble data set. Los Alamos National Laboratory, LA-UR-17-21595 (2017). http://dssdata.org
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001). https://doi.org/10.1145/584091.584093ArticleMathSciNetGoogle Scholar
Son, S., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., Choudhary, A.: Data compression for the exascale computing era - survey. Supercomput. Front. Innov. Int. J. 1(2), 76–88 (2014). https://doi.org/10.14529/jsfi140205
Stockinger, K., Shalf, J., Wu, K., Bethel, E.W.: Query-driven visualization of large data sets. In: VIS 05. IEEE Visualization 2005, pp. 167–174 (2005). https://doi.org/10.1109/VISUAL.2005.1532792
Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., Ahrens, J.: Taming massive distributed datasets: data sampling using bitmap indices. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’13, pp. 13–24. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2462902.2462906
Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007). http://www.jstor.org/stable/25464608
Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1129–1139 (2017). https://doi.org/10.1109/IPDPS.2017.115
Tikhonova, A., Correa, C.D., Ma, K.: Explorable images for visualizing volume data. In: 2010 IEEE Pacific Visualization Symposium (PacificVis), pp. 177–184 (2010) Google Scholar
Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In: Proceedings of the Workshop on Distributional Semantics and Compositionality, DiSCo ’11, pp. 16–20. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2043121.2043124
Verdu, S.: Fifty years of Shannon theory. IEEE Trans. Inf. Theory 44(6), 2057–2078 (1998). https://doi.org/10.1109/18.720531ArticleMathSciNetMATHGoogle Scholar
Wang, K., Kewei Lu, Wei, T., Shareef, N., Shen, H.: Statistical visualization and analysis of large data using a value-based spatial distribution. In: 2017 IEEE Pacific Visualization Symposium (PacificVis), pp. 161–170 (2017) Google Scholar
Wang, W., Bruyere, C., Kuo, B., Scheitlin, T.: IEEE visualization 2004 contest data set (2004). NCAR. http://sciviscontest.ieeevis.org/2004/data.html
Wei, T., Dutta, S., Shen, H.: Information guided data sampling and recovery using bitmap indexing. In: 2018 IEEE Pacific Visualization Symposium (PacificVis), pp. 56–65 (2018). https://doi.org/10.1109/PacificVis.2018.00016
Woodring, J., Ahrens, J., Figg, J., Wendelberger, J., Habib, S., Heitmann, K.: In-situ sampling of a large-scale particle simulation for interactive visualization and analysis. Comput. Graph. Forum 30(3), 1151–1160 (2011). https://doi.org/10.1111/j.1467-8659.2011.01964.xArticleGoogle Scholar
Ye, Y.C., Neuroth, T., Sauer, F., Ma, K., Borghesi, G., Konduri, A., Kolla, H., Chen, J.: In situ generated probability distribution functions for interactive post hoc visualization and analysis. In: 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV), pp. 65–74 (2016) Google Scholar

Acknowledgements

We would like to thank our Data Science at Scale Team colleagues: D. H. Rogers, L.-T. Lo, J. Patchett, our colleague from the Statistical Group CCS-6: Earl Lawrence, our industry partners at Kitware and other collaborators: C. Harrison, M. Larsen. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The Hurricane Isabel data set has kindly been provided by Wei Wang, Cindy Bruyere, Bill Kuo, and others at NCAR. Tim Scheitlin at NCAR converted the data into the Brick-of-Float format. The Turbulent Combustion data set is made available by Dr. Jacqueline Chen at Sandia National Laboratories through US Department of Energy’s SciDAC Institute for Ultrascale Visualization. This research was released under LA-UR-20-21090.

Author information

Authors and Affiliations

Los Alamos National Lab, Los Alamos, NM, USA Ayan Biswas, Soumya Dutta, Terece L. Turton & James Ahrens

Ayan Biswas