Why you should use Jupyter Notebooks
Why You Should be Using Jupyter Notebooks
Machine Learning
Modeling
jupyter
posted by
Daniel Gutierrez, ODSC
June 23, 2020
Daniel Gutierrez, ODSC
jupyter
4
This article provides a high-level overview of Project Jupyter and the widely popular Jupyter notebook technology. The overarching message I’d like...
This article provides a high-level overview of
Project Jupyter
and the widely popular Jupyter notebook technology. The overarching message I’d like to convey is why you should be using Jupyter for your data science projects. I’ve been using it for all my Python machine learning work and I’m quite impressed and satisfied. It’s a great environment with which to develop code, and also communicate results.
Project Jupyter
is a nonprofit organization created to “develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.” Spun-off from IPython in 2014 by co-founder Fernando Pérez, Project Jupyter supports execution environments in several dozen languages.
The name “Jupyter” was chosen to bring to mind the ideas and traditions of science and the scientific method. Additionally, the core programming languages supported by Jupyter are Julia, Python, and R. While the name Jupyter is not a direct acronym for these languages (Julia (Ju), Python (Py) and R), it does establish a firm alignment with them.
Jupyter Notebooks
Jupyter Notebook is an open source web application that allows data scientists to create and share documents that integrate real-time code, equations, computational output, visualization and other multimedia resources, and explanatory text into one document. You can use Jupyter Notebooks for various data science tasks, including data cleaning and transformation, numerical simulation, exploratory data analysis, data visualization, statistical modeling, machine learning, deep learning, and more.
A Jupyter Notebook provides you with an easy-to-use, interactive data science environment that doesn’t only work as an integrated development environment (IDE), but also as a presentation or educational tool. Jupyter is a way of working with Python inside a virtual “notebook” and is growing in popularity with data scientists in large part due to its flexibility. It gives you a way to combine code, images, plots, comments, etc., in alignment with the step of the “data science process.” Further, it is a form of interactive computing, an environment in which users execute code, see what happens, modify, and repeat in a kind of iterative conversation between the data scientist and data. Data scientists can also use notebooks to create tutorials or interactive manuals for their software. Here is a short instructional
video
to help get you started with Juypter.
A Jupyter notebook has two components. First, data scientists enter programming code or text in rectangular “cells” in a front-end web page. The browser then passes the code to a back-end “kernel” which runs the code and returns the results. Many Jupyter kernels have been created, supporting dozens of programming languages. The kernels need not reside on the data scientist’s computer. Notebooks can also run in the cloud such as Google’s Collaboratory project. You can even run Jupyter without network access right on your own computer and perform your work locally.
Other Jupyter Tools
JupyterLab (originally launched in beta in January 2018) is commonly viewed as the next-generation user interface for Project Jupyter offering all the familiar building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser, rich outputs, etc.) in a flexible and a more powerful user interface
.
The basic idea of the Jupyter Lab is to bring all the building blocks that are in the classic notebook, plus some new stuff, under one roof. JupyterLab extends the familiar notebook metaphor with drag-and-drop functionality, as well as file browsers, data viewers, text editors, and a command console. Whereas the standard Jupyter notebook assigns each notebook its own kernel, JupyterLab creates a computing environment that allows these components to be shared. Thus, a data scientist could view a notebook in one window, edit a required data file in another, and log all executed commands in a third – all within a single web-browser interface.
Example of JupyterLab
Two additional tools have enriched Jupyter’s usability. One is JuputerHub, a service that allows institutions to provide Jupyter notebooks across large pools of users. The other is Binder, an open-source service that allows data scientists to use Jupyter notebooks on GitHub in a web browser without having to install the software or any programming libraries.
Platforms Using Jupyter
The popularity of Jupyter goes beyond its use as a stand-alone tool, it’s also integrated with a number of platforms familiar to data scientists.
Anaconda is a prepackaged distribution of Python which contains a number of Python modules and packages, including Jupyter. In fact, Anaconda is the recommended distribution when installing Jupyter. This is how I use Jupyter because I enjoy the flexibility afforded by using the Anaconda Navigator and the ability to define a number of different “environments” with different frameworks like TensorFlow, different Python versions, etc.
Kaggle Kernels
are essentially Jupyter notebooks running in the browser, which means you can save yourself the hassle of setting up a local environment by having a Jupyter notebook environment inside your browser and use it anywhere in the world you have an internet connection.
Colab
notebooks
are Jupyter
notebooks that are hosted by Google Colab. Colab enables users to collaborate and run code that exploits Google’s cloud resources, i.e. GPUs, TPUs, and saving documents to Google Drive.
An
Amazon SageMaker
notebook instance is a fully managed machine learning EC2 compute instance that runs the Jupyter Notebook application. You use the notebook instance to create and manage Jupyter notebooks that you can use to prepare and process data and to train and deploy machine learning models.
Finally, there are many
examples
of Jupyter notebooks available on GitHub (reviewing them is a good way to learn what’s possible). There are more than 3 million public notebooks today, up from ~200,000 in 2015.
Conclusion
For data scientists, Jupyter has emerged in recent years as a de facto standard. The migration is arguably the fastest into a platform in recent memory. A majority of the ML/DL research papers appearing on the arXiv.prg pre-print server reference Jupyter notebooks that are well-integrated into the research using deep learning frameworks like TensorFlow and PyTorch. The beauty of Jupyter is that it creates a computational narrative, a document that allows researchers to supplement their code and data with analysis, hypothesis, and conjecture. For data scientists, that format can drive creative exploration. If you haven’t already looked at Jupyter technology it is high-time to do so!
Interesting in learning more about machine learning? Check out these
Ai+ training sessions
:
Machine Learning Foundations: Linear Algebra
This first installment in the Machine Learning Foundations series the topic at the heart of most machine learning approaches. Through the combination of theory and interactive examples, you’ll develop an understanding of how linear algebra is used to solve for unknown values in high-dimensional spaces, thereby enabling machines to recognize patterns and make predictions.
Supervised Machine Learning Series
Data Annotation at Scale: Active and Semi-Supervised Learning in Python
Explaining and Interpreting Gradient Boosting Models in Machine Learning
ODSC West 2020: Intelligibility Throughout the Machine Learning Lifecycle
Continuously Deployed Machine Learning
About author
Daniel Gutierrez, ODSC
Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.
1
Latest: 10 reasons data scientists love Jupyter notebooks
Next: Why Jupyter Notebook is so popular among data scientists