Ivy’s Data Science Weekly 31032021

data science weeklyCatch the data science weekly news and advancements here on Ivy Blog. From AI forecasts, microbial peptides to AI replacing therapists.

A new statistical method helps to ease the crisis of data reproducibility

A reproducibility crisis occurred in scientific research. As a result, many studies became difficult and almost impossible to replicate and validate. This especially happens when the study involves a very large sample size. For instance, to evaluate the validity of a high-throughput genetic study’s findings scientists must be able to replicate the study and achieve the same results. 
The researchers at Penn State and the University of Minnesota have developed a statistical tool that can accurately estimate the replicability of a study. This eliminates the need to duplicate the work and effectively mitigating the reproducibility crisis. According to Dajiang Liu, it is important to obtain data from a large number of individuals to detect patterns in genome-wide association studies. Scientists often acquire these data by combining many existing similarly designed studies. Liu and his colleagues did the same for the 2019 smoking and drinking addiction study that ultimately comprised 1.2 million individuals. In addition to that, Liu noted that the method can be applied to genome-wide association studies focused on a wide variety of traits. 
It is quite obvious that the opinions of 2.3 million people are more representative than the opinions of a randomly selected 400. In reality, it depends entirely on how the bigger data set was put together.  Hoping that high quantity can compensate for low quality is a classic mistake in the burgeoning field of big data, says Xiao-Li Meng, a professor of statistics at Harvard. In a perfectly random sample, there’s no correlation between someone’s opinion and their chance of being included in the data. If there’s even a 0.5% correlation—i.e., a small amount of selection bias—the nonrandom sample of 2.3 million will be no better than the random sample of 400, Meng says. That’s a reduction in an effective sample size of 99.98%.

Meng compares data analysis to test the saltiness of a large vat of soup. If the soup is well stirred, all you need is a tiny bit—less than a teaspoon—to tell how salty it is. In data terms, you’re taking a random sample of the soup vat. If the soup isn’t well stirred, you could drink gallons of it and still not know its average saltiness, because the part you didn’t taste might be different from the part you did taste.

Meng isn’t the first to stress the risk of selection bias. His contribution is in quantifying it. He has created what he calls a “data defect index” and has developed a formula that’s simple by the standards of mathematical statistics. Learn more in our future data science weekly news.

Surprisingly, AI has been getting quite a few things wrong all this while. This came to light when a team of researchers led by MIT has discovered that a bit more than 3% of the data in the most used machine learning systems, has been labeled incorrectly. The researchers looked at 10 major machine learning data sets. They found that 3.4% of the data available for the artificial intelligence machine learning systems have been mislabeled. There are multiple types of errors, including Amazon and IMDB reviews being incorrectly labeled as positive when they may actually be negative. The image-based tagging may incorrectly identify the subject in the image. There are video-based errors as well, such as a YouTube video being labeled as a church bell.
The problem with incorrectly labeled data sets in machine learning systems is that AI then learns the incorrect identification. It only makes it harder for AI-based systems to deliver the correct results. Or for us humans to be able to trust it at all. AI is now an integral part of a lot of things we interface with on a daily basis. Some of them such as web services, smartphones, smart speakers, and more. Researchers say that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. 
Websites, smartphone apps, and social media sites are dispensing mental-health advice, often using artificial intelligence. Meanwhile, clinicians and researchers are looking to AI to help define mental illness more objectively, identify high-risk people and ensure the quality of care. Some experts believe AI can make treatment more accessible and affordable. There has long been a severe shortage of mental health professionals. Since the Co-Vid pandemic, the need for support is greater than ever. For instance, users can have conversations with AI-powered chatbots, allowing them to get help anytime, anywhere, often for less money than traditional therapy.
The algorithms underpinning these endeavors learn by combing through large amounts of data. These data are generated from social media posts, smartphone data, electronic health records, therapy-session transcripts, brain scans, and other sources to identify patterns that are difficult for humans to discern. Despite the promise, there are some big concerns. The efficacy of some products is questionable. This is worse by the fact that private companies don’t always share information about how their AI works.
By combining machine learning, molecular dynamics simulations, and experiments it has been possible to design antimicrobial peptides from scratch. The approach by researchers at IBM is an important advance in a field where data is scarce and trial-and-error design is expensive and slow. Antimicrobial peptides – small molecules consisting of 12 to 50 amino acids – are promising drug candidates for tackling antibiotic resistance. 
Artificial intelligence (AI) tools are helpful in discovering new drugs. The team first used a machine learning system called a deep generative autoencoder. It captures information about different peptide sequences and then applied controlled latent attribute space sampling. It is a new computational method for generating peptide molecules with custom properties. This created a pool of 90,000 possible sequences. The screening of molecules using deep learning classifiers for additional key attributes such as toxicity and broad-spectrum activity. The researchers then carried out peptide–membrane binding simulations on the pre-screened candidates. They finally selected 20 peptides, which were tested in lab experiments and in mice. Their studies indicated that the new peptides work by disrupting pathogen membranes. 

Learn and grow with Ivy

Ivy Professional School will continue to bring the latest to its learners through a weekly data science news blog.  The growth in data science is huge and 2021 is becoming the turning point of the 21st century. However, the current requirement and demands outnumber the talent available in Data Science. Join our prestigious diploma certificate program backed by NASSCOM, Government of India to propel your career. Have trouble deciding take a look at all our courses and reach out to us here. We always strive to provide the best learning experience to our learners. Listen to our students. Always stay tuned and updated with our data science weekly.

Leave a Reply

Your email address will not be published. Required fields are marked *