NOAO Data Lab Frequently Asked Questions

What is the Data Lab?

The Data Lab is a new NOAO facility under development to allow researchers to work, in an efficient and well-integrated way, with datasets too big and unwieldy for typical desktop computers.

When will the Data Lab be available to use?

The first public release is planned for early 2017. A science demonstration is planned for early 2016 during the AAS meeting.

What will I be able to do with the Data Lab when it is first released?

The initial focus of the Data Lab will be DECam catalogs and images.  In its first release, the Data Lab will allow users to:

  • Explore the contents of large DECam catalogs and their relationship to  other large datasets or drill down to the original pixel data if necessary.
  • Experiment with workflows to perform quantitative analysis on DECam catalog data and associated pixels.  These workflows may include a mix of user-supplied code and core Data Lab tools.
  • Access selected NOAO survey data products through standardized interfaces.
  • Collaborate with other users using a shared workspace.

What will I be able to do with the Data Lab in later releases?

In later releases, the Data Lab will expand the scope of its capabilities to allow users to:

  • Explore and analyze spectral and time-series datasets
  • Implement flexible workflows using a variety of programmatic interfaces
  • Utilize additional computing resources to support more intensive workflows
  • Export workflows to larger scale computing facilities if necessary.
  • Utilize analysis tools based on user-contributed science code.

Is the Data Lab the new NOAO Science Archive? Is it a new computing center?

No. Although the Data Lab will provide some storage and compute resources, as well as access to other public datasets hosted at NOAO, the NOAO Science Archive (NSA) will continue to be the repository for (and principal source of) raw and pipelined image data. The Data Lab's proximity to the NSA and large catalogs will give it the fastest possible data access for workflows using its computing resources.

The Data Lab project sounds really ambitious. Will it really happen?

The Data Lab infrastructure and science capabilities can be largely built by adapting and extending existing code, some contributed by users during the operational phase. To control costs, the Data Lab will be implemented incrementally. The DataLab rationale, design, and project execution plan will be reviewed by an external panel of scientists and technologies in early 2015 to confirm NOAO is moving forward with an achievable project that will benefit the entire community.  The Data Lab project will be reviewed annually by a standing committee with rotating membership.

How will I access the Data Lab?

The Data Lab will consist of a number of components (including published data services, virtual storage, analysis and visualization tools).  In its first release, these will be accessible through a pre-defined mix of web interfaces, desktop applications, and command line-driven scripts.  In later releases, interfaces to Data Lab components will be accessible from a range of common programming languages to allow highly flexible custom analysis.

What types of analysis will I be able to do with the Data Lab?

The Data Lab will initially provide a core set of tools for data manipulation, visualization and catalog and pixel-level analysis. The number of science tools will grow over time and in later releases will include tools developed by users, either using core Data Lab tools as a base or by extending other users’ applications with new capabilities. Scriptability and batch execution will allow for increasingly complex workflows to be created if needed.

Will I be able to use my own code in the Data Lab?

Definitely. Users will be able to configure custom development or runtime environments in their Data Lab user shell to run interactive applications as they would on their desktop starting with the first release, but which can take advantage of faster data access and greater compute resources. These same custom environments will be able to be packaged with tasks deployed for use in batch processing of the data at other computing facilities or on the user’s desktop.  In later releases, programmatic interfaces to the Data Lab systems will allow applications and scripts to work directly with the Data Lab both locally and remotely.

Can’t I do this already on my desktop machine?

Up to a point, yes. However, PI observing programs can easily generate several TB of imaging data or catalogs with tens of millions of objects, and survey programs produce significantly more. As a consequence, data transfer times or computationally expensive algorithms will in many cases reach the limit of what common desktop systems can handle efficiently. For certain types of analysis, the Data Lab will offer the option of faster data access and more computing power for more efficient processing of your data.

Can’t I do this (and more) at a large super-computing center?

Certainly, but not everyone has ready access to these centers, and jobs too big for your desktop won’t necessarily require that scale of computing resources. The Data Lab will offer:

  • Additional computing power for those users and science cases that need it
  • An environment for doing the open-ended exploration, experimentation, and collaboration that is needed in the early stages of projects involving large datasets
  • A platform for developing and refining detailed workflows at smaller scale before moving them to large computing facilities

Will I be able to share my derived data products and results with collaborators?

The Data Lab’s virtual storage will allow users to create groups of other users that can have shared access to their data, analysis scripts, visualizations, and results (i.e. everything from intermediate data products to final catalogs or image stacks). Members of a survey team will be able to use group access to a centralized data store to divide the work (e.g. nightly runs) between members or to exchange results. Secured access to databases may similarly be shared within a group.

Will I be able to make my results publicly available?

Once a final set of data products is produced, the Data Lab will provide a publishing mechanism to make the data (catalogs, images or spectra) available using standard Virtual Observatory (VO) protocols. These publishing tools can be run on the user’s own machine to share the data, or in some cases the Data Lab can host the data as a more permanent service.

Will I be required to upload all of my local data to use it in the Data Lab?

Not always. Some components of the Data Lab will be exportable and can be installed on your local machine to reproduce the same functionality. For example, processing tools or virtual storage could be installed to work with local data, yet still allow workflows to remotely access large catalog data that may be needed in the analysis. Data Lab Virtual Machines will be available to download and run on your own hardware, or pre-packaged containers of specific components may be used on your machine (or in the cloud) when that is all that is needed.

Will I still be able to get my KPNO/CTIO data from the NSA and work entirely on my desktop?

Sure, and for some programs this may still be the best approach. Use of the Data Lab is entirely optional.

What common applications, frameworks, languages do you plan to use?

Because we are both adapting existing code and doing new development, Data Lab itself will be implemented using a variety of languages and frameworks, Support for user-written software will range from

  • legacy code that transparently accesses data in virtual storage through a locally mounted filesystem,
  • desktop tools that already support access to VO data services,
  • scripts that use core Data Lab tools in a workflow,
  • new science applications built on Data Lab programmatic interfaces.

Our initial focus will be to support high-level scripting through use of command-line utilities that can be called from many environments.  Special support will be given to Python when building programmatic interfaces for the first release, but since much of the underlying code is based on C/C++ it will be possible to generate bindings for many languages and environments in the longer term.

Can I/my group get involved with Data Lab development?

To the greatest extent possible, all Data Lab software deployed and/or developed by NOAO will be open source. We are happy to discuss potential collaborations or how we might be able to help your project.

Where can I get more information?

Interested users should contact Mike Fitzpatrick (fitz@noao.edu), Knut Olsen (kolsen@noao.edu), or Ken Mighell (kmighell@noao.edu).

Last updated or reviewed December 23, 2014.