Sampling for Scientific Data Analysis and Reduction

With exascale supercomputers on the horizon, data-driven in situ data reduction is a very important topic that potentially enables post hoc data visualization, reconstruction, and exploration with the goal of minimal information loss. Sophisticated sampling methods provide a fast approximation to the data that can be used as a preview to the simulation output without the need for full data reconstruction. More detailed analysis can then be performed by reconstructing the sampled data set as necessary. Other data reduction methods such as compression techniques can still be used with the sampled outputs to achieve further data reduction. Sampling can be achieved in the spatial domain (which data locations are to be stored?) and/or temporal domain (which time steps to be stored?). Given a spatial location, data-driven sampling approaches take into account its local properties (such as scalar value, local smoothness etc.) and multivariate association among scalar values to determine the importance of a location. For temporal sampling, changes in the local and global properties across time steps are taken into account as importance criteria. In this chapter, spatial sampling approaches are discussed for univariate and multivariate data sets and their use for effective in situ data reduction is demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic €32.70 /Month

Buy Now

Price includes VAT (France)

eBook EUR 96.29 Price includes VAT (France)

Softcover Book EUR 126.59 Price includes VAT (France)

Hardcover Book EUR 179.34 Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

Correlation-aware probabilistic data summarization for large-scale multi-block scientific data visualization

Article Open access 18 March 2023

High-Performance Spatial Data Compression for Scientific Applications

Chapter © 2022

In Situ Analysis and Visualization of Extreme-Scale Particle Simulations

Chapter © 2022

References

  1. Ahrens, J., Geveci, B., Law, C.: Paraview: An end-user tool for large data visualization. The Visualization Handbook, vol. 717 (2005) Google Scholar
  2. Ahrens, J., Jourdain, S., O’Leary, P., Patchett, J., Rogers, D.H., Petersen, M.: An image-based approach to extreme scale in situ visualization and analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 424–434. IEEE Press (2014) Google Scholar
  3. Akiba, H., Ma, K., Chen, J.H., Hawkes, E.R.: Visualizing multivariate volume data from turbulent combustion simulations. Comput. Sci. Eng. 9(2), 76–83 (2007). https://doi.org/10.1109/MCSE.2007.42
  4. Almgren, A.S., Bell, J.B., Lijewski, M.J., Lukić, Z., Van Andel, E.: Nyx: a massively parallel AMR code for computational cosmology. apj 765, 39 (2013). https://doi.org/10.1088/0004-637X/765/1/39
  5. Alted, F.: BLOSC (2009). http://blosc.pytables.org/. [online]
  6. Ayachit, U., Bauer, A., Geveci, B., O’Leary, P., Moreland, K., Fabian, N., Mauldin, J.: Paraview catalyst: enabling in situ data analysis and visualization. In: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, pp. 25–29. ACM (2015) Google Scholar
  7. Ayachit, U., Whitlock, B., Wolf, M., Loring, B., Geveci, B., Lonie, D., Bethel, E.W.: The sensei generic in situ interface. In: 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV), pp. 40–44 (2016). https://doi.org/10.1109/ISAV.2016.013
  8. Battle, L., Stonebraker, M., Chang, R.: Dynamic reduction of query result sets for interactive visualizaton. In: 2013 IEEE International Conference on Big Data, pp. 1–8 (2013). https://doi.org/10.1109/BigData.2013.6691708
  9. Bauer, A.C., et al.: In Situ methods, infrastructures, and applications on high performance computing platforms, a state-of-the-art (STAR) report. In: Computer Graphics Forum, Proceedings of Eurovis 2016, vol. 35(3) (2016). LBNL-1005709 Google Scholar
  10. Bilmes, J.: A gentle tutorial on the em algorithm including gaussian mixtures and baum-welch. Technical report, International Computer Science Institute (1997) Google Scholar
  11. Biswas, A., Dutta, S., Shen, H., Woodring, J.: An information-aware framework for exploring multivariate data sets. IEEE Trans. Vis. Comput. Graph. 19(12), 2683–2692 (2013). https://doi.org/10.1109/TVCG.2013.133ArticleGoogle Scholar
  12. Biswas, A., Dutta, S., Pulido, J., Ahrens, J.: In situ data-driven adaptive sampling for large-scale simulation data summarization. In: Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV ’18, pp. 13–18. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3281464.3281467
  13. Chen, M., Feixas, M., Viola, I., Bardera, A., Shen, H., Sbert, M.: Information Theory Tools for Visualization. CRC Press, Boca Raton, FL, USA (2006) Google Scholar
  14. Childs, H., et al.: VisIt: an end-user tool for visualizing and analyzing very large data. In: High Performance Visualization—Enabling Extreme-Scale Scientific Insight, pp. 357–372. CRC Press/Francis–Taylor Group (2012) Google Scholar
  15. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. In: Proceedings of the 27th annual meeting on Association for Computational Linguistics, ACL ’89, pp. 76–83. Association for Computational Linguistics, Stroudsburg, PA, USA (1989). https://doi.org/10.3115/981623.981633
  16. Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, 2nd edn. Wiley-Interscience, New York, NY, USA (2006) MATHGoogle Scholar
  17. Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with sz. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 730–739 (2016). https://doi.org/10.1109/IPDPS.2016.11
  18. Dutta, S., Biswas, A., Ahrens, J.: Multivariate pointwise information-driven data sampling and visualization. Entropy 21(7), 699 (2019) ArticleGoogle Scholar
  19. Dutta, S., Liu, X., Biswas, A., Shen, H.W., Chen, J.P.: Pointwise information guided visual analysis of time-varying multi-fields. In: SIGGRAPH Asia 2017 Symposium on Visualization, SA ’17, pp. 17:1–17:8. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3139295.3139298
  20. Fisher, D., Popov, I., Drucker, S., Schraefel, M.: Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pp. 1673–1682. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2207676.2208294
  21. Gosink, L., Anderson, J., Bethel, W., Joy, K.: Variable interactions in query-driven visualization. IEEE Trans. Vis. Comput. Graph. 13(6), 1400–1407 (2007). https://doi.org/10.1109/TVCG.2007.70519ArticleGoogle Scholar
  22. Gosink, L.J., Garth, C., Anderson, J.C., Bethel, E.W., Joy, K.I.: An application of multivariate statistical analysis for query-driven visualization. IEEE Trans. Vis. Comput. Graph. 17(3), 264–275 (2011). https://doi.org/10.1109/TVCG.2010.80ArticleGoogle Scholar
  23. Hazarika, S., Dutta, S., Shen, H., Chen, J.: Codda: a flexible copula-based distribution driven analysis framework for large-scale multivariate data. IEEE Trans. Vis. Comput. Graph. 25(1), 1214–1224 (2019). https://doi.org/10.1109/TVCG.2018.2864801ArticleGoogle Scholar
  24. Islam, A., Pearlman, W.A.: Embedded and efficient low-complexity hierarchical image coder. In: Electronic Imaging’99, pp. 294–305. International Society for Optics and Photonics (1998) Google Scholar
  25. Jänicke, H., Wiebel, A., Scheuermann, G., Kollmann, W.: Multifield visualization using local statistical complexity. IEEE Trans. Vis. Comput. Graph. 13(6), 1384–1391 (2007). https://doi.org/10.1109/TVCG.2007.70615ArticleGoogle Scholar
  26. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957). https://doi.org/10.1103/PhysRev.106.620ArticleMathSciNetMATHGoogle Scholar
  27. Larsen, M., Ahrens, J., Ayachit, U., Brugger, E., Childs, H., Geveci, B., Harrison, C.: The alpine in situ infrastructure: Ascending from the ashes of strawman. In: Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization, ISAV’17, pp. 42–46. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3144769.3144778
  28. Li, S., Marsaglia, N., Chen, V., Sewell, C., Clyne, J., Childs, H.: Achieving portable performance for wavelet compression using data parallel primitives. In: Proceedings of the 17th Eurographics Symposium on Parallel Graphics and Visualization, PGV ’17, p. 73–81. Eurographics Association, Goslar, DEU (2017). https://doi.org/10.2312/pgv.20171095
  29. Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014) ArticleGoogle Scholar
  30. Lindstrom, P., Isenburg, M.: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph. 12(5), 1245–1250 (2006) ArticleGoogle Scholar
  31. Liu, X., Shen, H.W.: Association analysis for visual exploration of multivariate scientific data sets. IEEE Trans. Vis. Comput. Graph. 22(1), 955–964 (2016). https://doi.org/10.1109/TVCG.2015.2467431ArticleGoogle Scholar
  32. Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible io and integration for scientific codes through the adaptable io system (adios). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’08, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1383529.1383533
  33. Lu, K., Shen, H.W.: A compact multivariate histogram representation for query-driven visualization. In: Proceedings of the 2015 IEEE 5th Symposium on Large Data Analysis and Visualization (LDAV), LDAV ’15, pp. 49–56 (2015) Google Scholar
  34. Nouanesengsy, B., Woodring, J., Patchett, J., Myers, K., Ahrens, J.: ADR visualization: a generalized framework for ranking large-scale scientific data using analysis-driven refinement. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 43–50 (2014). https://doi.org/10.1109/LDAV.2014.7013203
  35. Park, Y., Cafarella, M., Mozafari, B.: Visualization-aware sampling for very large databases. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 755–766 (2016). https://doi.org/10.1109/ICDE.2016.7498287
  36. Patchett, J., Gisler, G.: Deep water impact ensemble data set. Los Alamos National Laboratory, LA-UR-17-21595 (2017). http://dssdata.org
  37. Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001). https://doi.org/10.1145/584091.584093ArticleMathSciNetGoogle Scholar
  38. Son, S., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., Choudhary, A.: Data compression for the exascale computing era - survey. Supercomput. Front. Innov. Int. J. 1(2), 76–88 (2014). https://doi.org/10.14529/jsfi140205
  39. Stockinger, K., Shalf, J., Wu, K., Bethel, E.W.: Query-driven visualization of large data sets. In: VIS 05. IEEE Visualization 2005, pp. 167–174 (2005). https://doi.org/10.1109/VISUAL.2005.1532792
  40. Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., Ahrens, J.: Taming massive distributed datasets: data sampling using bitmap indices. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’13, pp. 13–24. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2462902.2462906
  41. Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007). http://www.jstor.org/stable/25464608
  42. Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1129–1139 (2017). https://doi.org/10.1109/IPDPS.2017.115
  43. Tikhonova, A., Correa, C.D., Ma, K.: Explorable images for visualizing volume data. In: 2010 IEEE Pacific Visualization Symposium (PacificVis), pp. 177–184 (2010) Google Scholar
  44. Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In: Proceedings of the Workshop on Distributional Semantics and Compositionality, DiSCo ’11, pp. 16–20. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2043121.2043124
  45. Verdu, S.: Fifty years of Shannon theory. IEEE Trans. Inf. Theory 44(6), 2057–2078 (1998). https://doi.org/10.1109/18.720531ArticleMathSciNetMATHGoogle Scholar
  46. Wang, K., Kewei Lu, Wei, T., Shareef, N., Shen, H.: Statistical visualization and analysis of large data using a value-based spatial distribution. In: 2017 IEEE Pacific Visualization Symposium (PacificVis), pp. 161–170 (2017) Google Scholar
  47. Wang, W., Bruyere, C., Kuo, B., Scheitlin, T.: IEEE visualization 2004 contest data set (2004). NCAR. http://sciviscontest.ieeevis.org/2004/data.html
  48. Wei, T., Dutta, S., Shen, H.: Information guided data sampling and recovery using bitmap indexing. In: 2018 IEEE Pacific Visualization Symposium (PacificVis), pp. 56–65 (2018). https://doi.org/10.1109/PacificVis.2018.00016
  49. Woodring, J., Ahrens, J., Figg, J., Wendelberger, J., Habib, S., Heitmann, K.: In-situ sampling of a large-scale particle simulation for interactive visualization and analysis. Comput. Graph. Forum 30(3), 1151–1160 (2011). https://doi.org/10.1111/j.1467-8659.2011.01964.xArticleGoogle Scholar
  50. Ye, Y.C., Neuroth, T., Sauer, F., Ma, K., Borghesi, G., Konduri, A., Kolla, H., Chen, J.: In situ generated probability distribution functions for interactive post hoc visualization and analysis. In: 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV), pp. 65–74 (2016) Google Scholar

Acknowledgements

We would like to thank our Data Science at Scale Team colleagues: D. H. Rogers, L.-T. Lo, J. Patchett, our colleague from the Statistical Group CCS-6: Earl Lawrence, our industry partners at Kitware and other collaborators: C. Harrison, M. Larsen. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The Hurricane Isabel data set has kindly been provided by Wei Wang, Cindy Bruyere, Bill Kuo, and others at NCAR. Tim Scheitlin at NCAR converted the data into the Brick-of-Float format. The Turbulent Combustion data set is made available by Dr. Jacqueline Chen at Sandia National Laboratories through US Department of Energy’s SciDAC Institute for Ultrascale Visualization. This research was released under LA-UR-20-21090.

Author information

Authors and Affiliations

  1. Los Alamos National Lab, Los Alamos, NM, USA Ayan Biswas, Soumya Dutta, Terece L. Turton & James Ahrens
  1. Ayan Biswas