Posted on Leave a comment

Responsible AI – Privacy and Security Requirements

Training data and prediction requests can both contain sensitive information about people / business which has to be protected. How do you safeguard the privacy of the individuals? What steps are taken to ensure that individuals have control of their data? There are regulations in countries to ensure privacy and security.

 In Europe you have the GDPR (General Data Protection Regulations) and in California there is CCPA (California Consumer Privacy Act,). Fundamentally, both give an individual control over its Data and requires that companies should protect the Data being used in the model. When Data processing is based on consent, then am individual has the right to revoke the consent at any time.

 Defending ML Models against attacks – Ensuring privacy of consumer data:

 I have discussed about very briefly about the tools for adversarial training – CleverHans and FoolBox Python libraries here: Model Debugging: Sensitivity Analysis, Adversarial Training, Residual Analysis  . Let us now look at more stringent means of protecting a ML model against attacks. It is important to protect the ML model against attacks, thus, ensuring the privacy and security of data. An ML model may be attacked in different ways – some literature classifies the attacks into: “Information Harms” and “Behavioural Harms”. Information Harm occurs when the information is allowed to leak from the model. There are different forms of Information Harms: Membership Inference, Model Inversion and Model Extraction. In Membership Inference, the attacker can determine if some information is part of the training data or not. In Model Inversion, the attacker can extract all the training data from the model and Model Extraction, the attacker is able to extract the entire model!

 Behavioural Harm occurs when the attacker can change the behaviour of the ML model itself – example: by inserting malicious data. In this post – I have given an example of an autonomous vehicle in this article: Model Debugging: Sensitivity Analysis, Adversarial Training, Residual Analysis

Cryptography | Differential privacy to protect data

You should consider privacy enhancing technologies like Secure Multi Party Computation ,(SMPC) and Fully Homomorphic Encryption (FHE). SMPC involves multiple systems to train or serve the model whilst the actual data is kept secure

In FHE the data is encrypted. Prediction requests involve encrypted data and training of the model is also carried out on encrypted data. This results in heavy computational cost because the data is never decrypted except by the user. Users will send encrypted prediction requests and will receive back an encrypted result. The goal is that using cryptography you can protect the consumers data.

Differential Privacy in Machine Learning

Differential privacy involves protection of the data by adding noise to the data so that the attackers cannot identify the real content. SmartNoise is an open-source project that contains components for building machine learning solutions with differential privacy. SmartNoise is made of following top level components:

✔️Smart Noise Core Library

✔️Smart Noise SDK Library

This is a good read to understand about Differential Privacy: https://docs.microsoft.com/en-us/azure/machine-learning/concept-differential-privacy

 Private Aggregation of Teacher Ensembles (PATE)

This follows the Knowledge Distillation concept that I discussed here: Post 1- Knowledge DistillationPost - 2 Knowldge Distillation. PATE begins by dividing the data into “k” partitions with no overlaps. It then trains k models on that data and then aggregates the results on an aggregate teacher model. During the aggregation for the aggregate teacher, you will add noise to the data and the output.

For deployment, you will use the student model. To train the student model you take unlabelled public data and feed it to the teacher model and the result is labelled data with which the student model is trained. For deployment, you use only the student model.

The process is illustrated in the figure below:

No alt text provided for this image

PATE (Private Aggregation of Teacher Ensembles)

Source

Credits:

Posted on Leave a comment

An extreme form of encryption could solve big data’s privacy problem

Fully homomorphic encryption allows us to run analysis on data without ever seeing the contents. It could help us reap the full benefits of big data, from fighting financial fraud to catching diseases early

LIKE any doctor, Jacques Fellay wants to give his patients the best care possible. But his instrument of choice is no scalpel or stethoscope, it is far more powerful than that. Hidden inside each of us are genetic markers that can tell doctors like Fellay which individuals are susceptible to diseases such as AIDS, hepatitis and more. If he can learn to read these clues, then Fellay would have advance warning of who requires early treatment.

This could be life-saving. The trouble is, teasing out the relationships between genetic markers and diseases requires an awful lot of data, more than any one hospital has on its own. You might think hospitals could pool their information, but it isn’t so simple. Genetic data contains all sorts of sensitive details about people that could lead to embarrassment, discrimination or worse. Ethical worries of this sort are a serious roadblock for Fellay, who is based at Lausanne University Hospital in Switzerland. “We have the technology, we have the ideas,” he says. “But putting together a large enough data set is more often than not the limiting factor.”

Fellay’s concerns are a microcosm of one of the world’s biggest technological problems. The inability to safely share data hampers progress in all kinds of other spheres too, from detecting financial crime to responding to disasters and governing nations effectively. Now, a new kind of encryption is making it possible to wring the juice out of data without anyone ever actually seeing it. This could help end big data’s big privacy problem – and Fellay’s patients could be some of the first to benefit.

It was more than 15 years ago that we first heard that “data is the new oil”, a phrase coined by the British mathematician and marketing expert Clive Humby. Today, we are used to the idea that personal data is valuable. Companies like Meta, which owns Facebook, and Google’s owner Alphabet grew into multibillion-dollar behemoths by collecting information about us and using it to sell targeted advertising.

Data could do good for all of us too. Fellay’s work is one example of how medical data might be used to make us healthier. Plus, Meta shares anonymised user data with aid organisations to help plan responses to floods and wildfires, in a project called Disaster Maps. And in the US, around 1400 colleges analyse academic records to spot students who are likely to drop out and provide them with extra support. These are just a few examples out of many – data is a currency that helps make the modern world go around.

Getting such insights often means publishing or sharing the data. That way, more people can look at it and conduct analyses, potentially drawing out unforeseen conclusions. Those who collect the data often don’t have the skills or advanced AI tools to make the best use of it, either, so it pays to share it with firms or organisations that do. Even if no outside analysis is happening, the data has to be kept somewhere, which often means on a cloud storage server, owned by an external company.

You can’t share raw data unthinkingly. It will typically contain sensitive personal details, anything from names and addresses to voting records and medical information. There is an obligation to keep this information private, not just because it is the right thing to do, but because of stringent privacy laws, such as the European Union’s General Data Protection Regulation (GDPR). Breaches can see big fines.

Over the past few decades, we have come up with ways of trying to preserve people’s privacy while sharing data. The traditional approach is to remove information that could identify someone or make these details less precise, says privacy expert Yves-Alexandre de Montjoye at Imperial College London. You might replace dates of birth with an age bracket, for example. But that is no longer enough. “It was OK in the 90s, but it doesn’t really work any more,” says de Montjoye. There is an enormous amount of information available about people online, so even seemingly insignificant nuggets can be cross-referenced with public information to identify individuals.

One significant case of reidentification from 2021 involves apparently anonymised data sold to a data broker by the dating app Grindr, which is used by gay people among others. A media outlet called The Pillar obtained it and correlated the location pings of a particular mobile phone represented in the data with the known movements of a high-ranking US priest, showing that the phone popped up regularly near his home and at the locations of multiple meetings he had attended. The implication was that this priest had used Grindr, and a scandal ensued because Catholic priests are required to abstain from sexual relationships and the church considers homosexual activity a sin.

A more sophisticated way of maintaining people’s privacy has emerged recently, called differential privacy. In this approach, the manager of a database never shares the whole thing. Instead, they allow people to ask questions about the statistical properties of the data – for example, “what proportion of people have cancer?” – and provide answers. Yet if enough clever questions are asked, this can still lead to private details being triangulated. So the database manager also uses statistical techniques to inject errors into the answers, for example recording the wrong cancer status for some people when totting up totals. Done carefully, this doesn’t affect the statistical validity of the data, but it does make it much harder to identify individuals. The US Census Bureau adopted this method when the time came to release statistics based on its 2020 census.

Trust no one

Still, differential privacy has its limits. It only provides statistical patterns and can’t flag up specific records – for instance to highlight someone at risk of disease, as Fellay would like to do. And while the idea is “beautiful”, says de Montjoye, getting it to work in practice is hard.

There is a completely different and more extreme solution, however, one with origins going back 40 years. What if you could encrypt and share data in such a way that others could analyse it and perform calculations on it, but never actually see it? It would be a bit like placing a precious gemstone in a glovebox, the chambers in labs used for handling hazardous material. You could invite people to put their arms into the gloves and handle the gem. But they wouldn’t have free access and could never steal anything.

This was the thought that occurred to Ronald Rivest, Len Adleman and Michael Dertouzos at the Massachusetts Institute of Technology in 1978. They devised a theoretical way of making the equivalent of a secure glovebox to protect data. It rested on a mathematical idea called a homomorphism, which refers to the ability to map data from one form to another without changing its underlying structure. Much of this hinges on using algebra to represent the same numbers in different ways.

Imagine you want to share a database with an AI analytics company, but it contains private information. The AI firm won’t give you the algorithm it uses to analyse data because it is commercially sensitive. So, to get around this, you homomorphically encrypt the data and send it to the company. It has no key to decrypt the data. But the firm can analyse the data and get a result, which itself is encrypted. Although the firm has no idea what it means, it can send it back to you. Crucially, you can now simply decrypt the result and it will make total sense.

“The promise is massive,” says Tom Rondeau at the US Defense Advanced Research Projects Agency (DARPA), which is one of many organisations investigating the technology. “It’s almost hard to put a bound to what we can do if we have this kind of technology.”

In the 30 years since the method was proposed, researchers devised homomorphic encryption schemes that allowed them to carry out a restricted set of operations, for instance only additions or multiplications. Yet fully homomorphic encryption, or FHE, which would let you run any program on the encrypted data, remained elusive. “FHE was what we thought of as being the holy grail in those days,” says Marten van Dijk at CWI, the national research institute for mathematics and computer science in the Netherlands. “It was kind of unimaginable.”

One approach to homomorphic encryption at the time involved an idea called lattice cryptography. This encrypts ordinary numbers by mapping them onto a grid with many more dimensions than the standard two. It worked – but only up to a point. Each computation ended up adding randomness to the data. As a result, doing anything more than a simple computation led to so much randomness building up that the answer became unreadable.

In 2009, Craig Gentry, then a PhD student at Stanford University in California, made a breakthroughHis brilliant solution was to periodically remove this randomness by decrypting the data under a secondary covering of encryption. If that sounds paradoxical, imagine that glovebox with the gem inside. Gentry’s scheme was like putting one glovebox inside another, so that the first one could be opened while still encased in a layer of security. This provided a workable FHE scheme for the first time.

Workable, but still slow: computations on the FHE-encrypted data could take millions of times longer than identical ones on raw data. Gentry went on to work at IBM, and over the next decade, he and others toiled to make the process quicker by improving the underlying mathematics. But lately the focus has shifted, says Michael Osborne at IBM Research in Zurich, Switzerland. There is a growing realisation that massive speed enhancements can be achieved by optimising the way cryptography is applied for specific uses. “We’re getting orders of magnitudes improvements,” says Osborne.

IBM now has a suite of FHE tools that can run AI and other analyses on encrypted data. Its researchers have shown they can detect fraudulent transactions in encrypted credit card data using an artificial neural network that can crunch 4000 records per second. They also demonstrated that they could use the same kind of analysis to scour the encrypted CT scans of more than 1500 people’s lungs to detect signs of covid-19 infection.

Also in the works are real-world, proof-of-concept projects with a variety of customers. In 2020, IBM revealed the results of a pilot study conducted with the Brazilian bank Banco Bradesco. Privacy concerns and regulations often prevent banks from sharing sensitive data either internally or externally. But in the study, IBM showed it could use machine learning to analyse encrypted financial transactions from the bank’s customers to predict if they were likely to take out a loan. The system was able to make predictions for more than 16,500 customers in 10 seconds and it performed just as accurately as the same analysis performed on unencrypted data.

Suspicious activity

Other companies are keen on this extreme form of encryption too. Computer scientist Shafi Goldwasser, a co-founder of privacy technology start-up Duality, says the firm is achieving significantly faster speeds by helping customers better structure their data and tailoring tools to their problems. Duality’s encryption tech has already been integrated into the software systems that technology giant Oracle uses to detect financial crimes, where it is assisting banks in sharing data to detect suspicious activity.

Still, for most applications, FHE processing remains at least 100,000 times slower compared with unencrypted data, says Rondeau. This is why, in 2020, DARPA launched a programme called Data Protection in Virtual Environments to create specialised chips designed to run FHE. Lattice-encrypted data comes in much larger chunks than normal chips are used to dealing with. So several research teams involved in the project, including one led by Duality, are investigating ways to alter circuits to efficiently process, store and move this kind of data. The goal is to analyse any FHE-encrypted data just 10 times slower than usual, says Rondeau, who is managing the programme.

Even if it were lightning fast, FHE wouldn’t be flawless. Van Dijk says it doesn’t work well with certain kinds of program, such as those that contain branching logic made up of “if this, do that” operations. Meanwhile, information security researcher Martin Albrecht at Royal Holloway, University of London, points out that the justification for FHE is based on the need to share data so it can be analysed. But a lot of routine data analysis isn’t that complicated – doing it yourself might sometimes be simpler than getting to grips with FHE.

For his part, de Montjoye is a proponent of privacy engineering: not relying on one technology to protect people’s data, but combining several approaches in a defensive package. FHE is a great addition to that toolbox, he reckons, but not a standalone winner.

This is exactly the approach that Fellay and his colleagues have taken to smooth the sharing of medical data. Fellay worked with computer scientists at the Swiss Federal Institute of Technology in Lausanne who created a scheme combining FHE with another privacy-preserving tactic called secure multiparty computation (SMC). This sees the different organisations join up chunks of their data in such a way that none of the private details from any organisation can be retrieved.

In a paper published in October 2021, the team used a combination of FHE and SMC to securely pool data from multiple sources and use it to predict the efficacy of cancer treatments or identify specific variations in people’s genomes that predict the progression of HIV infection. The trial was so successful that the team has now deployed the technology to allow Switzerland’s five university hospitals to share patient data, both for medical research and to help doctors personalise treatments. “We’re implementing it in real life,” says Fellay, “making the data of the Swiss hospitals shareable to answer any research question as long as the data exists.”

If data is the new oil, then it seems the world’s thirst for it isn’t letting up. FHE could be akin to a new mining technology, one that will open up some of the most valuable but currently inaccessible deposits. Its slow speed may be a stumbling block. But, as Goldwasser says, comparing the technology with completely unencrypted processing makes no sense. “If you believe that security is not a plus, but it’s a must,” she says, “then in some sense there is no overhead.”

NewScientist

6 April 2022

By Edd Gent

Posted on Leave a comment

GDPR Three Years on the Road: The 10 Key Developments You Should Know

on July 29, 2021

On the third anniversary of the General Data Protection Regulation, Cooley started a series of webinars focused on the GDPR.

Our first webinar covers what we consider “the Top 10 key developments you should know” concerning the implementation of this ground-breaking personal data privacy regime.https://videopress.com/embed/jeN8KgiT?preloadContent=metadata&hd=1

#1: GDPR: It’s here to stay, and it’s never going to go away!

There’s been some debate around the need to reform the GDPR. However, it is unlikely that this reform is going to happen in the short term if we take into consideration that the European Commission noted in its 2020 evaluation report of the GDPR that it considers the GDPR has met its objectives. For the European Commission, the GDPR has given stronger rights to individuals while businesses are developing a compliance culture and using data protection as a competitive advantage, among others. 

#2: Playtime seems to be over (both for companies and DPAs)

Looking at the past three years of enforcement by the national data protection authorities, we have seen some kind of evolution in the enforcement area:

  • From June to the end of 2018: National authorities were setting up and reorganizing their teams to align their internal structure and resources with their new roles under the GDPR. This resulted in very few enforcements
  • Year 2019: The enforcement increased in 2019, but it consisted mainly of small fines and small companies being targeted
  • Year 2020: National data protection authorities started imposing very high monetary penalties, but many of these were appealed
  • Year 2021: This year, we have started to see more mature and sophisticated enforcement decisions

#3: GDPR: the global ripple effect

GDPR has been a great inspiration around the globe. Some countries have started to implement new data protection frameworks that are aligned with the GDPR, such as the United States with the California Consumer Privacy Act and Brazil with the General Law for the Protection of Personal Data (LGPD). India is following closely, and a law is expected to be finalized at the end of this year.

#4: Data transfers have become a key challenge

Data transfers have become a key challenge for global organizations. Following the European Court of Justice Schrems II case, companies need to complete a Data Transfer Impact Assessment before transferring any data outside of the EEA, assessing the law and practice of the country of the data importer.

Although the European Court of Justice didn’t invalidate the SCCs, companies now also have to supplement them with additional contractual and technical measures following the European Data Protection Board guidance.

#5: Brexit has added an additional level of complexity

Following Brexit, we now have two GDPRs – a UK one and an EU one. Although currently both frameworks are basically identical, we may expect that there will be some deviations in the future. Brexit has also brought some duplications in relation to appointments of DPOs, representatives and BCRs.

#6: EU countries make use of the possibility to finetune by national laws

The GDPR has brought a fair amount of harmonization into the EU data protection framework, however, it’s important to note that EU Member States still have the possibility to finetune the GDPR locally by imposing additional requirements in areas such as the appointment of DPOs, processing activities that require a Data Protection Impact Assessment, or the age under which parental consent is needed to provide online services to children.

#7: To consent or not to consent, that’s the question

GDPR raises the bar for consent: pre-ticked boxes are not valid, and companies shall be able to demonstrate that individuals were totally free when they gave consent. Also, consent can be withdrawn at any time. All of this makes consent a difficult legal basis to rely on.

#8: Regulator guidance: creating clarity or more confusion? (Thankfully it’s black and white…. No grey areas to cause confusion)

The EDPB and the national data protection authorities have issued a lot of guidance since 2018 on multiple matters such as virtual voice assistants, data breach notifications, international data transfers and the concepts of controller and processor. In most cases the guidance is more restrictive than the GDPR.

The European Court of Justice has also had an active role in defining the GDPR through cases such as Fashion ID, Orange and Schrems II.

#9: Much more sophisticated and balanced data processing/sharing agreements

The relationship between data processors and controllers has become more mature and sophisticated. All steps of the relationship – from the onboarding phase, following with the contract execution and during the whole contractual relationship – have been impacted by the GDPR.

#10: And more is yet to come: what about 2022?

The EU Commission is quite active on data protection. There’s new legislation on the horizon mirroring GDPR, such as the Artificial Intelligence Regulation. Another area where we expect changes is e-privacy.

From the United States’ perspective, there is a lot of activity and, as mentioned earlier, the GDPR has inspired it. Apart from the CCPA, in 2018, Alabama enacted a data breach notification law, and other states such as Washington, Virginia and New York have begun to introduce legislation of baseline privacy laws.

Cooley’s cyber/data/privacy group
  • 50+ lawyers globally counseling on privacy, cybersecurity and data protection matters
  • Holistic approach to compliance and security, built to preserve and protect enterprise value
  • Market leading privacy and data breach litigation
Contributors

Patrick Van Eecke

Guadalupe Sampedro

Travis LeBlanc

Amal Ali