Best Practices in Software Engineering for Scientific Computing


Dr. Madicken Munk

National Center for Supercomputing Applications

2021.09.24

About Me:
Oregon State University
Nuclear Engineering, University of California, Berkeley Nuclear Engineering, University of California, Berkeley Radiation Transport Group, Oak Ridge National Laboratory BIDS Logo
illinois NCSA Logo DXL Logo
About Me:
The Carpentries git-novice Disc committee
scipy school yt project
open source directions

yt and widgyts


Project website
Project repository

My slides and talk from SciPy 2018

Science

  • builds and organizes knowledge
  • tests explanations about the universe
  • systematically,
  • objectively,
  • transparently,
  • and reproducibly.

Otherwise it's not science.

Computers

should...

  • improve efficiency,
  • reduce human error,
  • automate the mundane,
  • simplify the complex,
  • and accelerate research.

But scientists don't use them effectively.

“ Computational science is a special case of scientific research: the work is easily shared via the Internet since the paper, code, and data are digital and those three aspects are all that is required to reproduce the results, given sufficient computation tools. ” - Stodden, 2010.

Getting Started

“ Organized Skepticism. Scientists are critical: All ideas must be tested and are subject to rigorous structured community scrutiny.” - R.K. Merton, 1942

Data Storage

  • Good: pencil and paper
  • Better: spreadsheet
  • Best: standardized file format, database management system


Formats: Evaluated Nuclear Data File (ENDF), GRIdded Binary (GRIB), Flexible Image Transport System (FITS), Hierarchical Data Format (HDF), etc.

Management: C/Python/Fortran APIs, HDF5, SQL, MySQL, sqlite, MongoDB, Hadoop, etc.

Backing Up Files

  • Good: hope
  • Better: nightly emails
  • Best: remote version control (GitHub, BitBucket, Launchpad, GitLab)


Version Control Systems: cvs, svn, hg, git

Hint: tools like git-annex or git-lfs can help you manage large data files

Managing Changes

  • Good: naming convention
  • Better: clever naming convention
  • Best: local version control

Version Control.

Getting It Done

“ It takes just as much time to write a good paper as it takes to write a bad one. ” - Polterovich, 2014

Analysis

  • Good: pencil and calculator
  • Better: spreadsheets, matlab, mathematica
  • Best: scripting, open source libraries, modern programming language


Hint: Check out GitHub for existing toolkits for analysis in your domain. e.g. PyNE, serpenttools

Multiple File Cleanup

  • Good: manually edit every file
  • Better: search and replace in each file
  • Best: scripted batch editing


Hint: try a tutorial on BASH, CSH, Python, or Perl, e.g. the bash lesson by Software Carpentry.

Excecuting Workflows

  • Good: retype a series of commands
  • Better: bash script
  • Best: build system


Build System Tools: make, snakemake, autoconf, automake, cmake, docker, etc.

Reference: The Carpentries have an associated Automation and Make lesson.

Data Structures

  • Good: 100 string variables holding doubles
  • Better: lists of lists of doubles
  • Best: appropriate powerful data structures


Hint: In FORTAN, learn about arrays. In C++, learn about maps, vectors, deques, queues, etc. In python, the power lies in dictionaries and numpy arrays.

Addendum: Perhaps DataFrames, xarray,

API Design

  • Good: single block of procedural code
  • Better: separate functions
  • Best: small, testable functions, grouped into classes, DRY


DRY: Dont Repeat Yourself. Code replication is bug proliferation.

Hint: github.com/audreyr/cookiecutter or github.com/uwescience/shablona

Variable Naming

  • Good: d1, d2, d3
  • Better: x, y, z
  • Best: p.x, p.y, p.z, p=Point(x,y,z)

Hint: Prof. Jenny Bryan on Naming Things

Style Guides

  • Good: Have style
  • Better: Agree with your colleages on style
  • Best: Follow a standard style guide (e.g. PEP8)

Hint: C++ Style Guide , Black code formatter for Python

File I/O

  • Good: none, hardcoded variables
  • Better: plain text input file, line-by-line homemade string parsing
  • Best: file parsing library


Tools: python argparse, xml rng, json, etc.

Getting It Right

“ The scientific method’s central motivation is the ubiquity of error—the awareness that mistakes and self-delusion can creep in absolutely anywhere and that the scientist’s effort is primarily expended in recognizing and rooting out error. ” - Donoho, 2009.

Error Detection

  • Good: show results to experts
  • Better: integration testing
  • Best: unit test suite, continuous integration

Error Diagnostics

  • Good: re-re-read the code
  • Better: print statements
  • Best: use a linter, a debugger, and a profiler


Tools: cpplint, pyflakes, gdb, lldb, pdb, idb, valgrind, kernprof, kcachegrind, cprofile/snakeviz

Error Correction

  • Good: fix code
  • Better: fix, add an exception
  • Best: fix, add an exception, add a test

Hint: katyhuff.github.io/python-testing

Getting It Together

“ Two of the biggest challenges scientists and other programmers face when working with code and data are keeping track of changes (and being able to revert them if things go wrong), and collaborating on a program or dataset. ” - Wilson, et al. 2014.

Merging Collaborative Work

  • Good: single master copy, waiting
  • Better: emails and patches
  • Best: remote version control

Peer Review For Code

  • Good: separation of concerns
  • Better: shared repository
  • Best: peer-reviewed pull requests
“ just-in-time review of small code changes is more likely to succeed than large-scale end-of-work reviews. ” - Petre, Wilson 2014

Teamwork

  • Good: weekly research meetings, year-long tasks
  • Better: daily conversations, month-long goals
  • Best: pair programming, issue tracking

Software Handovers

  • Good: zip file, theory paper
  • Better: comments in code, example input file
  • Best: automated documentation, test suite


Books: Clean Code, Working Effectively with Legacy Code

Tools: sphinx, doxygen, googletest, unitttest, nosetests, pytest

Getting It Out There

“ If a piece of scientific software is released in the forest, does it change the field? ”

Plotting

  • Good: custom formatting, clickable GUI
  • Better: plot format templates (excel, mathematica)
  • Best: scripted plotting, matplotlib, gnuplot, etc.

Writing

  • Good: stone tablet, microsoft word
  • Better: word with track changes, open office
  • Best: plain text markup with version control and a makefile


Tools: LaTeX, markdown, restructured text

Distribution Control

  • Good: "email to request access"
  • Better: license file
  • Best: license file, citation file, DOI, forkable repository


Example: github.com/cyclus

Unique Issue in Nuclear Engineering

Export control is serious.

Export Control is a big deal in nuclear

Community Adoption

  • Good: none, internal use only
  • Better: online repository, developer email online
  • Best: issue tracker, user/developer listhost(s), communication channels, online documentation

Resources

Ok, I'm convinced. So how can one learn this stuff?

Online Resources

Papers!

Good Books

  • Clean Code - Robert C. Martin
  • Working Effectively with Legacy Code - Martin Fowler
  • Effective Computation in Physics - Huff, Scopatz

Acknowledgements


Many of these slides were originally in a presentation by Dr. Katy Huff at:
katyhuff.github.io/2017-09-20-ncsa
which is licensed under a Creative Commons Attribution 4.0 International License.

Thank You!

Madicken Munk


https://munkm.github.io/2021-09-24-NCSA
This publication is supported in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4561 to Matthew Turk
Creative Commons License
Best Practices in Software Engineering for Scientific Computing by Madicken Munk is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at http://munkm.github.io/2021-09-24-NCSA.
// // More info about config & dependencies: // // - https://github.com/hakimel/reveal.js#configuration // // - https://github.com/hakimel/reveal.js#dependencies // Reveal.initialize({ // controls: true, // progress: true, // slideNumber: true, // history: true, // center: true, // // theme: Reveal.getQueryHash().theme, // available themes are in /css/theme // transition: Reveal.getQueryHash().transition || 'fade', // default/cube/page/concave/zoom/linear/fade/none // // // Parallax scrolling // // parallaxBackgroundImage: 'https://s3.amazonaws.com/hakim-static/reveal-js/reveal-parallax-1.jpg', // // parallaxBackgroundSize: '2100px 900px', // // // Optional libraries used to extend on reveal.js // math: { // mathjax: 'https://cdn.mathjax.org/mathjax/latest/MathJax.js', // config: 'TeX-AMS_HTML-full' // See http://docs.mathjax.org/en/latest/config-files.html // }, // dependencies: [ // { src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } }, // { src: 'plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } }, // { src: 'plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } }, // { src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } }, // { src: 'plugin/zoom-js/zoom.js', async: true, condition: function() { return !!document.body.classList; } }, // { src: 'plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } }, // { src: 'plugin/math/math.js', async: true } // ] // });