Posted on

What is differential privacy in machine learning (preview)?

How differential privacy works

Differential privacy is a set of systems and practices that help keep the data of individuals safe and private. In machine learning solutions, differential privacy may be required for regulatory compliance.

Differential privacy machine learning process.

In traditional scenarios, raw data is stored in files and databases. When users analyze data, they typically use the raw data. This is a concern because it might infringe on an individual’s privacy. Differential privacy tries to deal with this problem by adding “noise” or randomness to the data so that users can’t identify any individual data points. At the least, such a system provides plausible deniability. Therefore, the privacy of individuals is preserved with limited impact on the accuracy of the data.

In differentially private systems, data is shared through requests called queries. When a user submits a query for data, operations known as privacy mechanisms add noise to the requested data. Privacy mechanisms return an approximation of the data instead of the raw data. This privacy-preserving result appears in a report. Reports consist of two parts, the actual data computed and a description of how the data was created.

Differential privacy metrics

Differential privacy tries to protect against the possibility that a user can produce an indefinite number of reports to eventually reveal sensitive data. A value known as epsilon measures how noisy, or private, a report is. Epsilon has an inverse relationship to noise or privacy. The lower the epsilon, the more noisy (and private) the data is.

Epsilon values are non-negative. Values below 1 provide full plausible deniability. Anything above 1 comes with a higher risk of exposure of the actual data. As you implement machine learning solutions with differential privacy, you want to data with epsilon values between 0 and 1.

Another value directly correlated to epsilon is delta. Delta is a measure of the probability that a report isn’t fully private. The higher the delta, the higher the epsilon. Because these values are correlated, epsilon is used more often.

Limit queries with a privacy budget

To ensure privacy in systems where multiple queries are allowed, differential privacy defines a rate limit. This limit is known as a privacy budget. Privacy budgets prevent data from being recreated through multiple queries. Privacy budgets are allocated an epsilon amount, typically between 1 and 3 to limit the risk of reidentification. As reports are generated, privacy budgets keep track of the epsilon value of individual reports as well as the aggregate for all reports. After a privacy budget is spent or depleted, users can no longer access data.

Reliability of data

Although the preservation of privacy should be the goal, there’s a tradeoff when it comes to usability and reliability of the data. In data analytics, accuracy can be thought of as a measure of uncertainty introduced by sampling errors. This uncertainty tends to fall within certain bounds. Accuracy from a differential privacy perspective instead measures the reliability of the data, which is affected by the uncertainty introduced by the privacy mechanisms. In short, a higher level of noise or privacy translates to data that has a lower epsilon, accuracy, and reliability.

Open-source differential privacy libraries

SmartNoise is an open-source project that contains components for building machine learning solutions with differential privacy. SmartNoise is made up of the following top-level components:

  • SmartNoise Core library
  • SmartNoise SDK library

SmartNoise Core

The core library includes the following privacy mechanisms for implementing a differentially private system:

Component Description
Analysis A graph description of arbitrary computations.
Validator A Rust library that contains a set of tools for checking and deriving the necessary conditions for an analysis to be differentially private.
Runtime The medium to execute the analysis. The reference runtime is written in Rust but runtimes can be written using any computation framework such as SQL and Spark depending on your data needs.
Bindings Language bindings and helper libraries to build analyses. Currently SmartNoise provides Python bindings.

SmartNoise SDK

The system library provides the following tools and services for working with tabular and relational data:

Component Description
Data Access

Library that intercepts and processes SQL queries and produces reports. This library is implemented in Python and supports the following ODBC and DBAPI data sources:

  • PostgreSQL
  • SQL Server
  • Spark
  • Preston
  • Pandas
Service Execution service that provides a REST endpoint to serve requests or queries against shared data sources. The service is designed to allow composition of differential privacy modules that operate on requests containing different delta and epsilon values, also known as heterogeneous requests. This reference implementation accounts for additional impact from queries on correlated data.
Evaluator

Stochastic evaluator that checks for privacy violations, accuracy, and bias. The evaluator supports the following tests:

  • Privacy Test – Determines whether a report adheres to the conditions of differential privacy.
  • Accuracy Test – Measures whether the reliability of reports falls within the upper and lower bounds given a 95% confidence level.
  • Utility Test – Determines whether the confidence bounds of a report are close enough to the data while still maximizing privacy.
  • Bias Test – Measures the distribution of reports for repeated queries to ensure they aren’t unbalanced

Next steps

Learn more about differential privacy in machine learning:

Posted on

Responsible AI – Privacy and Security Requirements

Training data and prediction requests can both contain sensitive information about people / business which has to be protected. How do you safeguard the privacy of the individuals? What steps are taken to ensure that individuals have control of their data? There are regulations in countries to ensure privacy and security.

 In Europe you have the GDPR (General Data Protection Regulations) and in California there is CCPA (California Consumer Privacy Act,). Fundamentally, both give an individual control over its Data and requires that companies should protect the Data being used in the model. When Data processing is based on consent, then am individual has the right to revoke the consent at any time.

 Defending ML Models against attacks – Ensuring privacy of consumer data:

 I have discussed about very briefly about the tools for adversarial training – CleverHans and FoolBox Python libraries here: Model Debugging: Sensitivity Analysis, Adversarial Training, Residual Analysis  . Let us now look at more stringent means of protecting a ML model against attacks. It is important to protect the ML model against attacks, thus, ensuring the privacy and security of data. An ML model may be attacked in different ways – some literature classifies the attacks into: “Information Harms” and “Behavioural Harms”. Information Harm occurs when the information is allowed to leak from the model. There are different forms of Information Harms: Membership Inference, Model Inversion and Model Extraction. In Membership Inference, the attacker can determine if some information is part of the training data or not. In Model Inversion, the attacker can extract all the training data from the model and Model Extraction, the attacker is able to extract the entire model!

 Behavioural Harm occurs when the attacker can change the behaviour of the ML model itself – example: by inserting malicious data. In this post – I have given an example of an autonomous vehicle in this article: Model Debugging: Sensitivity Analysis, Adversarial Training, Residual Analysis

Cryptography | Differential privacy to protect data

You should consider privacy enhancing technologies like Secure Multi Party Computation ,(SMPC) and Fully Homomorphic Encryption (FHE). SMPC involves multiple systems to train or serve the model whilst the actual data is kept secure

In FHE the data is encrypted. Prediction requests involve encrypted data and training of the model is also carried out on encrypted data. This results in heavy computational cost because the data is never decrypted except by the user. Users will send encrypted prediction requests and will receive back an encrypted result. The goal is that using cryptography you can protect the consumers data.

Differential Privacy in Machine Learning

Differential privacy involves protection of the data by adding noise to the data so that the attackers cannot identify the real content. SmartNoise is an open-source project that contains components for building machine learning solutions with differential privacy. SmartNoise is made of following top level components:

✔️Smart Noise Core Library

✔️Smart Noise SDK Library

This is a good read to understand about Differential Privacy: https://docs.microsoft.com/en-us/azure/machine-learning/concept-differential-privacy

 Private Aggregation of Teacher Ensembles (PATE)

This follows the Knowledge Distillation concept that I discussed here: Post 1- Knowledge DistillationPost – 2 Knowldge Distillation. PATE begins by dividing the data into “k” partitions with no overlaps. It then trains k models on that data and then aggregates the results on an aggregate teacher model. During the aggregation for the aggregate teacher, you will add noise to the data and the output.

For deployment, you will use the student model. To train the student model you take unlabelled public data and feed it to the teacher model and the result is labelled data with which the student model is trained. For deployment, you use only the student model.

The process is illustrated in the figure below:

No alt text provided for this image

PATE (Private Aggregation of Teacher Ensembles)

Source

Credits:

Posted on

A one-up on motion capture

A new neural network approach captures the characteristics of a physical system’s dynamic motion from video, regardless of rendering configuration or image differences.
 
 

MIT researchers used the RISP method to predict the action sequence, joint stiffness, or movement of an articulated hand, like this one, from a target image or video.

From “Star Wars” to “Happy Feet,” many beloved films contain scenes that were made possible by motion capture technology, which records movement of objects or people through video. Further, applications for this tracking, which involve complicated interactions between physics, geometry, and perception, extend beyond Hollywood to the military, sports training, medical fields, and computer vision and robotics, allowing engineers to understand and simulate action happening within real-world environments.

As this can be a complex and costly process — often requiring markers placed on objects or people and recording the action sequence — researchers are working to shift the burden to neural networks, which could acquire this data from a simple video and reproduce it in a model. Work in physics simulations and rendering shows promise to make this more widely used, since it can characterize realistic, continuous, dynamic motion from images and transform back and forth between a 2D render and 3D scene in the world. However, to do so, current techniques require precise knowledge of the environmental conditions where the action is taking place, and the choice of renderer, both of which are often unavailable.

Now, a team of researchers from MIT and IBM has developed a trained neural network pipeline that avoids this issue, with the ability to infer the state of the environment and the actions happening, the physical characteristics of the object or person of interest (system), and its control parameters. When tested, the technique can outperform other methods in simulations of four physical systems of rigid and deformable bodies, which illustrate different types of dynamics and interactions, under various environmental conditions. Further, the methodology allows for imitation learning — predicting and reproducing the trajectory of a real-world, flying quadrotor from a video.

“The high-level research problem this paper deals with is how to reconstruct a digital twin from a video of a dynamic system,” says Tao Du PhD ’21, a postdoc in the Department of Electrical Engineering and Computer Science (EECS), a member of Computer Science and Artificial Intelligence Laboratory (CSAIL), and a member of the research team. In order to do this, Du says, “we need to ignore the rendering variances from the video clips and try to grasp of the core information about the dynamic system or the dynamic motion.”

Du’s co-authors include lead author Pingchuan Ma, a graduate student in EECS and a member of CSAIL; Josh Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and a member of CSAIL; Wojciech Matusik, professor of electrical engineering and computer science and CSAIL member; and MIT-IBM Watson AI Lab principal research staff member Chuang Gan. This work was presented this week the International Conference on Learning Representations.

While capturing videos of characters, robots, or dynamic systems to infer dynamic movement makes this information more accessible, it also brings a new challenge. “The images or videos [and how they are rendered] depend largely on the on the lighting conditions, on the background info, on the texture information, on the material information of your environment, and these are not necessarily measurable in a real-world scenario,” says Du. Without this rendering configuration information or knowledge of which renderer is used, it’s presently difficult to glean dynamic information and predict behavior of the subject of the video. Even if the renderer is known, current neural network approaches still require large sets of training data. However, with their new approach, this can become a moot point. “If you take a video of a leopard running in the morning and in the evening, of course, you’ll get visually different video clips because the lighting conditions are quite different. But what you really care about is the dynamic motion: the joint angles of the leopard — not if they look light or dark,” Du says.

In order to take rendering domains and image differences out of the issue, the team developed a pipeline system containing a neural network, dubbed “rendering invariant state-prediction (RISP)” network. RISP transforms differences in images (pixels) to differences in states of the system — i.e., the environment of action — making their method generalizable and agnostic to rendering configurations. RISP is trained using random rendering parameters and states, which are fed into a differentiable renderer, a type of renderer that measures the sensitivity of pixels with respect to rendering configurations, e.g., lighting or material colors. This generates a set of varied images and video from known ground-truth parameters, which will later allow RISP to reverse that process, predicting the environment state from the input video. The team additionally minimized RISP’s rendering gradients, so that its predictions were less sensitive to changes in rendering configurations, allowing it to learn to forget about visual appearances and focus on learning dynamical states. This is made possible by a differentiable renderer.

The method then uses two similar pipelines, run in parallel. One is for the source domain, with known variables. Here, system parameters and actions are entered into a differentiable simulation. The generated simulation’s states are combined with different rendering configurations into a differentiable renderer to generate images, which are fed into RISP. RISP then outputs predictions about the environmental states. At the same time, a similar target domain pipeline is run with unknown variables. RISP in this pipeline is fed these output images, generating a predicted state. When the predicted states from the source and target domains are compared, a new loss is produced; this difference is used to adjust and optimize some of the parameters in the source domain pipeline. This process can then be iterated on, further reducing the loss between the pipelines.

To determine the success of their method, the team tested it in four simulated systems: a quadrotor (a flying rigid body that doesn’t have any physical contact), a cube (a rigid body that interacts with its environment, like a die), an articulated hand, and a rod (deformable body that can move like a snake). The tasks included estimating the state of a system from an image, identifying the system parameters and action control signals from a video, and discovering the control signals from a target image that direct the system to the desired state. Additionally, they created baselines and an oracle, comparing the novel RISP process in these systems to similar methods that, for example, lack the rendering gradient loss, don’t train a neural network with any loss, or lack the RISP neural network altogether. The team also looked at how the gradient loss impacted the state prediction model’s performance over time. Finally, the researchers deployed their RISP system to infer the motion of a real-world quadrotor, which has complex dynamics, from video. They compared the performance to other techniques that lacked a loss function and used pixel differences, or one that included manual tuning of a renderer’s configuration.

In nearly all of the experiments, the RISP procedure outperformed similar or the state-of-the-art methods available, imitating or reproducing the desired parameters or motion, and proving to be a data-efficient and generalizable competitor to current motion capture approaches.

For this work, the researchers made two important assumptions: that information about the camera is known, such as its position and settings, as well as the geometry and physics governing the object or person that is being tracked. Future work is planned to address this.

“I think the biggest problem we’re solving here is to reconstruct the information in one domain to another, without very expensive equipment,” says Ma. Such an approach should be “useful for [applications such as the] metaverse, which aims to reconstruct the physical world in a virtual environment,” adds Gan. “It is basically an everyday, available solution, that’s neat and simple, to cross domain reconstruction or the inverse dynamics problem,” says Ma.

This research was supported, in part, by the MIT-IBM Watson AI Lab, Nexplore, DARPA Machine Common Sense program, Office of Naval Research (ONR), ONR MURI, and Mitsubishi Electric.

Source

Posted on

DataRobot’s vision to democratize machine learning with no-code AI

 

The growing digitization of nearly every aspect of our world and lives has created immense opportunities for the productive application of machine learning and data science. Organizations and institutions across the board are feeling the need to innovate and reinvent themselves by using artificial intelligence and putting their data to good use. And according to several surveys, data science is among the fastest-growing in-demand skills in different sectors.

However, the growing demand for AI is hampered by the very low supply of data scientists and machine learning experts. Among the efforts to address this talent gap is the fast-evolving field of no-code AI, tools that make the creation and deployment of ML models accessible to organizations that don’t have enough highly skilled data scientists and machine learning engineers.

In an interview with TechTalks, Nenshad Bardoliwalla, chief product officer at DataRobot, discussed the challenges of meeting the needs of machine learning and data science in different sectors and how no-code platforms are helping democratize artificial intelligence.

Not enough data scientists

Nenshad Bardoliwallathe business value of machine learning, whether it’s predicting customer churn, ad clicks, the possibility of an engine breakdown, medical outcomes, or something else.

“We are seeing more and more companies who recognize that their competition is able to exploit AI and ML in interesting ways and they’re looking to keep up,” Bardoliwalla said.

At the same time, the growing demand for data science skills has driven a wedge into the AI talent gap continue. And not everyone is served equally.

Underserved industries

The shortage of experts has created fierce competition for data science and machine learning talent. The financial sector is leading the way, aggressively hiring AI talent and putting machine learning models into use.

“If you look at financial services, you’ll clearly see that the number of machine learning models that are being put into production is by far the highest than any of the other segments,” Bardoliwalla said.

In parallel, big tech companies with deep pockets are also hiring top data scientists and machine learning engineers—or outright acquiring AI labs with all their engineers and scientists—to further fortify their data-driven commercial empires. Meanwhile, smaller companies and sectors that are not flush with cash have been largely left out of the opportunities provided by advances in artificial intelligence because they can’t hire enough data scientists and machine learning experts.

Bardoliwalla is especially passionate about what AI could do for the education sector.

“How much effort is being put into optimized student outcomes by using AI and ML? How much do the education industry and the school systems have in order to invest in that technology? I think the education industry as a whole is likely to be a lagger in the space,” he said.

Other areas that still have a ways to go before they can take advantage of advances in AI are transportation, utilities, and heavy machinery. And part of the solution might be to make ML tools that don’t require a degree in data science.

The no-code AI vision

no-code ai platform

“For every one of your expert data scientists, you have ten analytically savvy businesspeople who are able to frame the problem correctly and add the specific business-relevant calculations that make sense based on the domain knowledge of those people,” Bardoliwalla said.

As machine learning requires knowledge of programming languages such as Python and R and complicated libraries such as NumPy, Scikit-learn, and TensorFlow, most business people can’t create and test models without the help of expert data scientists. This is the area that no-code AI platforms are addressing.

DataRobot and other providers of no-code AI platforms are creating tools that enable these domain experts and business-savvy people to create and deploy machine learning models without the need to write code.

With DataRobot, users can upload their datasets on the platform, perform the necessary preprocessing steps, choose and extract features, and create and compare a range of different machine learning models, all through an easy-to-use graphical user interface.

“The whole notion of democratization is to allow companies and people in those companies who wouldn’t otherwise be able to take advantage of AI and ML to actually be able to do so,” Bardoliwalla said.

No-code AI is not a replacement for the expert data scientist. But it increases ML productivity across organizations, empowering more people to create models. This lifts much of the burden from the overloaded shoulders of data scientists and enables them to put their skills to more efficient use.

“The one person in that equation, the expert data scientist, is able to validate and govern and make sure that the models that are being generated by the analytically savvy businesspeople are quite accurate and make sense from an interpretability perspective—that they’re trustworthy,” Bardoliwalla said.

This evolution of machine learning tools is analogous to how the business intelligence industry has changed. A decade ago, the ability to query data and generate reports at organizations was limited to a few people who had the special coding skill set required to manage databases and data warehouses. But today, the tools have evolved to the point that non-coders and less technical people can perform most of their data querying tasks through easy-to-use graphical tools and without the assistance of expert data analysts. Bardoliwalla believes that the same transformation is happening in the AI industry thanks to no-code AI platforms.

“Whereas the business intelligence industry has historically focused on what has happened—and that is useful—AI and ML is going is to give every person in the business the ability to predict what is going to happen,” Bardoliwalla said. “We believe that we can put AI and ML into the hands of millions of people in organizations because we have simplified the process to the point that many analytically savvy business people—and there are millions of such folks—working with the few million data scientists can deliver AI- and ML-specific outcomes.”

The evolution of no-code AI at DataRobot

Source