Item List

Bioinformatics journey

I shall be sharing with you all my journey as a self taught bioinformatician soon

Breast Cancer Classification Project in Python

Get aware with the terms used in Breast Cancer Classification project in Python What is Deep Learning? An intensive approach to Machine Learning, Deep Learning is inspired by the workings of the human brain and its biological neural networks. Architectures as deep neural networks, recurrent neural networks, convolutional neural networks, and deep belief networks are made of multiple layers for the data to pass through before finally producing the output. Deep Learning serves to improve AI and make many of its applications possible; it is applied to many such fields of computer vision, speech recognition, natural language processing, audio recognition, and drug design. What is Keras? Keras is an open-source neural-network library written in Python. It is a high-level API and can run on top of TensorFlow, CNTK, and Theano. Keras is all about enabling fast experimentation and prototyping while running seamlessly on CPU and GPU. It is user-friendly, modular, and extensible. Breast Cancer Classification – Objective To build a breast cancer classifier on an IDC dataset that can accurately classify a histology image as benign or malignant. Breast Cancer Classification – About the Python Project In this project in python, we’ll build a classifier to train on 80% of a breast cancer histology image dataset. Of this, we’ll keep 10% of the data for validation. Using Keras, we’ll define a CNN (Convolutional Neural Network), call it CancerNet, and train it on our images. We’ll then derive a confusion matrix to analyze the performance of the model. IDC is Invasive Ductal Carcinoma; cancer that develops in a milk duct and invades the fibrous or fatty breast tissue outside the duct; it is the most common form of breast cancer forming 80% of all breast cancer diagnoses. And histology is the study of the microscopic structure of tissues. The Dataset We’ll use the IDC_regular dataset (the breast cancer histology image dataset) from Kaggle. This dataset holds 2,77,524 patches of size 50×50 extracted from 162 whole mount slide images of breast cancer specimens scanned at 40x. Of these, 1,98,738 test negative and 78,786 test positive with IDC. The dataset is available in public domain and you can download it here. You’ll need a minimum of 3.02GB of disk space for this. Filenames in this dataset look like this: 8863_idx5_x451_y1451_class0 Here, 8863_idx5 is the patient ID, 451 and 1451 are the x- and y- coordinates of the crop, and 0 is the class label (0 denotes absence of IDC). Prerequisites You’ll need to install some python packages to be able to run this advanced python project. You can do this with pip- pip install numpy opencv-python pillow tensorflow keras imutils scikit-learn matplotlib Steps for Advanced Project in Python – Breast Cancer Classification 1. Download this zip. Unzip it at your preferred location, get there. 2. Now, inside the inner breast-cancer-classification directory, create directory datasets- inside this, create directory original: mkdir datasets mkdir datasets\original 3. Download the dataset. 4. Unzip the dataset in the original directory. To observe the structure of this directory, we’ll use the tree command: cd breast-cancer-classification\breast-cancer-classification\datasets\original We have a directory for each patient ID. And in each such directory, we have the 0 and 1 directories for images with benign and malignant content. config.py: This holds some configuration we’ll need for building the dataset and training the model. You’ll find this in the cancernet directory. import os INPUT_DATASET = "datasets/original" BASE_PATH = "datasets/idc" TRAIN_PATH = os.path.sep.join([BASE_PATH, "training"]) VAL_PATH = os.path.sep.join([BASE_PATH, "validation"]) TEST_PATH = os.path.sep.join([BASE_PATH, "testing"]) TRAIN_SPLIT = 0.8 VAL_SPLIT = 0.1 Screenshot: Here, we declare the path to the input dataset (datasets/original), that for the new directory (datasets/idc), and the paths for the training, validation, and testing directories using the base path. We also declare that 80% of the entire dataset will be used for training, and of that, 10% will be used for validation. build_dataset.py: This will split our dataset into training, validation, and testing sets in the ratio mentioned above- 80% for training (of that, 10% for validation) and 20% for testing. With the ImageDataGenerator from Keras, we will extract batches of images to avoid making space for the entire dataset in memory at once. from cancernet import config from imutils import paths import random, shutil, os originalPaths=list(paths.list_images(config.INPUT_DATASET)) random.seed(7) random.shuffle(originalPaths) index=int(len(originalPaths)*config.TRAIN_SPLIT) trainPaths=originalPaths[:index] testPaths=originalPaths[index:] index=int(len(trainPaths)*config.VAL_SPLIT) valPaths=trainPaths[:index] trainPaths=trainPaths[index:] datasets=[("training", trainPaths, config.TRAIN_PATH), ("validation", valPaths, config.VAL_PATH), ("testing", testPaths, config.TEST_PATH) ] for (setType, originalPaths, basePath) in datasets: print(f'Building {setType} set') if not os.path.exists(basePath): print(f'Building directory {base_path}') os.makedirs(basePath) for path in originalPaths: file=path.split(os.path.sep)[-1] label=file[-5:-4] labelPath=os.path.sep.join([basePath,label]) if not os.path.exists(labelPath): print(f'Building directory {labelPath}') os.makedirs(labelPath) newPath=os.path.sep.join([labelPath, file]) shutil.copy2(inputPath, newPath) Screenshot: In this, we’ll import from config, imutils, random, shutil, and os. We’ll build a list of original paths to the images, then shuffle the list. Then, we calculate an index by multiplying the length of this list by 0.8 so we can slice this list to get sublists for the training and testing datasets. Next, we further calculate an index saving 10% of the list for the training dataset for validation and keeping the rest for training itself. Now, datasets is a list with tuples for information about the training, validation, and testing sets. These hold the paths and the base path for each. For each setType, path, and base path in this list, we’ll print, say, ‘Building testing set’. If the base path does not exist, we’ll create the directory. And for each path in originalPaths, we’ll extract the filename and the class label. We’ll build the path to the label directory(0 or 1)- if it doesn’t exist yet, we’ll explicitly create this directory. Now, we’ll build the path to the resulting image and copy the image here- where it belongs. 5. Run the script build_dataset.py: py build_dataset.py Output Screenshot: cancernet.py: The network we’ll build will be a CNN (Convolutional Neural Network) and call it CancerNet. This network performs the following operations: Use 3×3 CONV filters Stack these filters on top of each other Perform max-pooling Use depthwise separable convolution (more efficient, takes up less memory) from keras.models import Sequential from keras.layers.normalization import BatchNormalization from keras.layers.convolutional import SeparableConv2D from keras.layers.convolutional import MaxPooling2D from keras.layers.core import Activation from keras.layers.core import Flatten from keras.layers.core import Dropout from keras.layers.core import Dense from keras import backend as K class CancerNet: @staticmethod def build(width,height,depth,classes): model=Sequential() shape=(height,width,depth) channelDim=-1 if K.image_data_format()=="channels_first": shape=(depth,height,width) channelDim=1 model.add(SeparableConv2D(32, (3,3), padding="same",input_shape=shape)) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(SeparableConv2D(64, (3,3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(SeparableConv2D(64, (3,3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(SeparableConv2D(128, (3,3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(SeparableConv2D(128, (3,3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(SeparableConv2D(128, (3,3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=channelDim)) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(256)) model.add(Activation("relu")) model.add(BatchNormalization()) model.add(Dropout(0.5)) model.add(Dense(classes)) model.add(Activation("softmax")) : We use the Sequential API to build CancerNet and SeparableConv2D to implement depthwise convolutions. The class CancerNet has a static method build that takes four parameters- width and height of the image, its depth (the number of color channels in each image), and the number of classes the network will predict between, which, for us, is 2 (0 and 1). In this method, we initialize model and shape. When using channels_first, we update the shape and the channel dimension. Now, we’ll define three DEPTHWISE_CONV => RELU => POOL layers; each with a higher stacking and a greater number of filters. The softmax classifier outputs prediction percentages for each class. In the end, we return the model. train_model.py: This trains and evaluates our model. Here, we’ll import from keras, sklearn, cancernet, config, imutils, matplotlib, numpy, and os. import matplotlib matplotlib.use("Agg") from keras.preprocessing.image import ImageDataGenerator from keras.callbacks import LearningRateScheduler from keras.optimizers import Adagrad from keras.utils import np_utils from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from cancernet.cancernet import CancerNet from cancernet import config from imutils import paths import matplotlib.pyplot as plt import numpy as np import os NUM_EPOCHS=40; INIT_LR=1e-2; BS=32 trainPaths=list(paths.list_images(config.TRAIN_PATH)) lenTrain=len(trainPaths) lenVal=len(list(paths.list_images(config.VAL_PATH))) lenTest=len(list(paths.list_images(config.TEST_PATH))) trainLabels=[int(p.split(os.path.sep)[-2]) for p in trainPaths] trainLabels=np_utils.to_categorical(trainLabels) classTotals=trainLabels.sum(axis=0) classWeight=classTotals.max()/classTotals trainAug = ImageDataGenerator( rescale=1/255.0, rotation_range=20, zoom_range=0.05, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.05, horizontal_flip=True, vertical_flip=True, fill_mode="nearest") valAug=ImageDataGenerator(rescale=1 / 255.0) trainGen = trainAug.flow_from_directory( config.TRAIN_PATH, class_mode="categorical", target_size=(48,48), color_mode="rgb", shuffle=True, batch_size=BS) valGen = valAug.flow_from_directory( config.VAL_PATH, class_mode="categorical", target_size=(48,48), color_mode="rgb", shuffle=False, batch_size=BS) testGen = valAug.flow_from_directory( config.TEST_PATH, class_mode="categorical", target_size=(48,48), color_mode="rgb", shuffle=False, batch_size=BS) model=CancerNet.build(width=48,height=48,depth=3,classes=2) opt=Adagrad(lr=INIT_LR,decay=INIT_LR/NUM_EPOCHS) model.compile(loss="binary_crossentropy",optimizer=opt,metrics=["accuracy"]) M=model.fit_generator( trainGen, steps_per_epoch=lenTrain//BS, validation_data=valGen, validation_steps=lenVal//BS, class_weight=classWeight, epochs=NUM_EPOCHS) print("Now evaluating the model") testGen.reset() pred_indices=model.predict_generator(testGen,steps=(lenTest//BS)+1) pred_indices=np.argmax(pred_indices,axis=1) print(classification_report(testGen.classes, pred_indices, target_names=testGen.class_indices.keys())) cm=confusion_matrix(testGen.classes,pred_indices) total=sum(sum(cm)) accuracy=(cm[0,0]+cm[1,1])/total specificity=cm[1,1]/(cm[1,0]+cm[1,1]) sensitivity=cm[0,0]/(cm[0,0]+cm[0,1]) print(cm) print(f'Accuracy: {accuracy}') print(f'Specificity: {specificity}') print(f'Sensitivity: {sensitivity}') N = NUM_EPOCHS plt.style.use("ggplot") plt.figure() plt.plot(np.arange(0,N), M.history["loss"], label="train_loss") plt.plot(np.arange(0,N), M.history["val_loss"], label="val_loss") plt.plot(np.arange(0,N), M.history["acc"], label="train_acc") plt.plot(np.arange(0,N), M.history["val_acc"], label="val_acc") plt.title("Training Loss and Accuracy on the IDC Dataset") plt.xlabel("Epoch No.") plt.ylabel("Loss/Accuracy") plt.legend(loc="lower left") plt.savefig('plot.png') In this script, first, we set initial values for the number of epochs, the learning rate, and the batch size. We’ll get the number of paths in the three directories for training, validation, and testing. Then, we’ll get the class weight for the training data so we can deal with the imbalance. Now, we initialize the training data augmentation object. This is a process of regularization that helps generalize the model. This is where we slightly modify the training examples to avoid the need for more training data. We’ll initialize the validation and testing data augmentation objects. We’ll initialize the training, validation, and testing generators so they can generate batches of images of size batch_size. Then, we’ll initialize the model using the Adagrad optimizer and compile it with a binary_crossentropy loss function. Now, to fit the model, we make a call to fit_generator(). We have successfully trained our model. Now, let’s evaluate the model on our testing data. We’ll reset the generator and make predictions on the data. Then, for images from the testing set, we get the indices of the labels with the corresponding largest predicted probability. And we’ll display a classification report. Now, we’ll compute the confusion matrix and get the raw accuracy, specificity, and sensitivity, and display all values. Finally, we’ll plot the training loss and accuracySummary In this project in python, we learned to build a breast cancer classifier on the IDC dataset (with histology images for Invasive Ductal Carcinoma) and created the network CancerNet for the same. We used Keras to implement the same. Hope you enjoyed this Python project.

DATA SCIENCE, sounds interesting......

What is data science? A groundbreaking 2013 study found that 90% of all global data was generated in the last two years. Let that sink in. In just two years we have collected and processed 9 times more information than in humanity's 92,000 years combined. And it doesn't slow down. We are forecast to have already created 2.7 zettabytes of data, and by 2020 that number will grow to a staggering 44 zettabytes. What do we do with all this data? How do we make it useful to us? What are the real applications? These questions are the domain of data science. Any company will say it is doing some kind of data science, but what exactly does that mean? The field is growing so fast and revolutionizing so many industries that it is difficult to delineate its capabilities with a formal definition, but in general, data science is dedicated to extracting clean insights from raw data to form actionable insights. Commonly known as “the oil of the 21st century”, our digital data is of the utmost importance in this area. They have invaluable advantages in business, research, and our everyday lives. Your commute to work, your last Google search for the nearby cafe, your Instagram post about what you ate, and even your fitness tracker health data are important to different data scientists in different ways. Data science examines vast oceans of data, looks for connections and patterns and is responsible for bringing us new products, providing innovative information, and making our lives more convenient. How does data science work? Data Science spans a variety of disciplines and subject areas to provide a refined, comprehensive, and holistic view of raw data. Data scientists must be trained in all areas of data engineering, mathematics, statistics, advanced computing, and visualization in order to effectively examine disorderly amounts of information and communicate only the most important bits and pieces that help drive innovation and efficiency. Data scientists also rely heavily on artificial intelligence, particularly its deep learning and machine learning subdomains, to build models and make predictions using algorithms and other techniques. Data Science generally has a five-stage life cycle consisting of : Capture: data acquisition, data entry, signal reception, data extraction Maintenance: data storage, data cleaning, data storage, data processing, data architecture Process: data mining, grouping, data modelling, data summarization Communication: data reporting, data visualization, business intelligence, decision making Analysis: exploration/confirmation, analysis predictive, regression, text mining, qualitative analysis. FOLLOW 21ST AVENUE FOR MORE

DRUG DISCOVERY WITH CHEMBL WEBSOURSE-ORAL MOLECULES FOR SARS

This is a small drug discovery bioinformatics project. We are searching for molecules that could be taken orally by a human to treat SARS. It is my first complete protein delocalisation from chembl get the code on Github. pip install chembl_webresource_client
%matplotlib inline
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/') ### This is needed to make sure we can import RDKit in Google Colab.
import pandas as pd
from chembl_webresource_client.new_client import new_client
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
targets
selected_target = targets.target_chembl_id[4]
selected_target
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target)
df = pd.DataFrame.from_dict(res)
Narrowing down to IC50
activity2 = new_client.activity
res2 = activity2.filter(target_chembl_id=selected_target).filter(type='IC50')
df2 = pd.DataFrame.from_dict(res2)
df2.head(10)
Let's save a CSV at this point and the reload the new CSV.
df2.to_csv('IMP13_bioactivity_data.csv', index=False)
df3 = df2
### df3 = pd.read_csv(r'C:\Users\PheXeRiaN\Desktop\Drug Discovery\IMP13_bioactivity_data.csv') ### This is command for when saving csv not in Google Colab to load from CSV.
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
df3.head(5)
df4 = df3[df3.value.notna()]
df4.head(5)
bioactivity_class = []
for x in df4.standard_value:
if float(x) >= 10000:
bioactivity_class.append('inactive')
elif float(x) <= 1000:
bioactivity_class.append('active')
else:
bioactivity_class.append('intermediate')
Now we need to remove duplicates for a few of the rows and moves features around to a new dataframe.
[ ]
molecule_chembl_id = []
for x in df4.molecule_chembl_id:
molecule_chembl_id.append(x)
[ ]
canonical_smiles = []
for x in df4.canonical_smiles:
canonical_smiles.append(x)

[ ]
standard_value = []
for x in df4.standard_value:
standard_value.append(x)
[ ]
data_tuples = list(zip(molecule_chembl_id, canonical_smiles, bioactivity_class, standard_value))

df5 = pd.DataFrame(data_tuples, columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])
df5.head(5)
[ ]
df5.to_csv('bioactivity_preprocessed_data.csv', index=False)

Calculate Lipinski descriptors. Rule of Five for ADME pharmacokinetic profile.

! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

def lipinski(smiles, verbose=False):

moldata= []
for elem in smiles:
mol=Chem.MolFromSmiles(elem)
moldata.append(mol)

baseData= np.arange(1,1)
i=0
for mol in moldata:

desc_MolWt = Descriptors.MolWt(mol)
desc_MolLogP = Descriptors.MolLogP(mol)
desc_NumHDonors = Lipinski.NumHDonors(mol)
desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)

row = np.array([desc_MolWt,
desc_MolLogP,
desc_NumHDonors,
desc_NumHAcceptors])

if(i==0):
baseData=row
else:
baseData=np.vstack([baseData, row])
i=i+1

columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]
descriptors = pd.DataFrame(data=baseData,columns=columnNames)

return descriptors

### Made from code from https://codeocean.com/capsule/8848590/tree/v1

df_lipinski = lipinski(df.canonical_smiles)





[ ]


df_lipinski.head(10)

Combining dataframes

[ ]


df6 = pd.concat([df5, df_lipinski], axis=1)

[ ]


df6.head(10)

import seaborn as sns

plt.figure(figsize=(8,5))
sns.distplot(df6.standard_value)

Now need to convert IC50 to negative log scale (-log10(IC50)) which gives us a more uniform distribution for prediction.



[ ]


def pIC50(input):
pIC50 = []

for i in input['standard_value_norm']:
molar = i*(10**-9) # Converts nanoMolar to Molar
pIC50.append(-np.log10(molar))

input['pIC50'] = pIC50
x = input.drop('standard_value_norm', 1)

return x







We will max out values at 100000000 to make it easier for us. This gets rid of negative values if negative log is calculated.

[ ]


df6.standard_value.dropna(inplace=True)





[ ]


df6.standard_value = pd.to_numeric(df6.standard_value).astype(float)





[ ]


def norm_value(input):
norm = []

for i in input['standard_value']:
if i > 100000000:
i = 100000000
norm.append(i)

input['standard_value_norm'] = norm
x = input.drop('standard_value', 1)

return x





[ ]


df_normalized = norm_value(df6)
df_normalized.head(10)







[ ]


df7 = pIC50(df_normalized)
df7.head(10)







[ ]


plt.figure(figsize=(8,5))
sns.distplot(df7.pIC50)







[ ]


df7.pIC50.describe()



count 105.000000 mean 4.876188 std 0.919190 min 3.000000 25% 4.346787 50% 4.823909 75% 5.055517 max 7.301030 Name: pIC50, dtype: float64






Removing intermediate class



[ ]


df_2class = df7[df7.bioactivity_class != 'intermediate']







Looking at Active vs Inactive molecules



[ ]


sns.set(style='darkgrid')
sns.countplot(x='bioactivity_class', data = df_2class, edgecolor='black')

plt.xlabel('Bioactivity class', fontsize = 17)
plt.ylabel('Frequency', fontsize = 17)
plt.title('Inactive vs Active Bar Graph', size = 20, fontweight = 'bold')

plt.savefig('plot_bioactivity_class.png')







[ ]


sns.scatterplot(x='MW', y='LogP', data=df_2class, hue='bioactivity_class', size='pIC50', edgecolor='black', alpha=0.7)

plt.xlabel('MW', fontsize=14)
plt.ylabel('LogP', fontsize=14)
plt.title('MW to LogP', size = 20, fontweight = 'bold')
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0, fancybox = True)

plt.savefig('plot_MW_vs_LogP.png')







[ ]


sns.boxplot(x = 'bioactivity_class', y = 'pIC50', data = df_2class)

plt.xlabel('Bioactivity class', fontsize=14)
plt.ylabel('pIC50 value', fontsize=14)
plt.title('Boxplot of Inactive to Active', size = 20, fontweight = 'bold')

plt.savefig('plot_ic50.png')









Now a Mann-Whitney U Test to make sure both groups are statistically different. This MannWhitney test is for active vs inactive bioactivity.



[ ]


def mannwhitney(descriptor, verbose=True):
from numpy.random import seed
from numpy.random import randn
from scipy.stats import mannwhitneyu


seed(42)

### actives and inactives
selection = [descriptor, 'bioactivity_class']
df = df_2class[selection]
active = df[df.bioactivity_class == 'active']
active = active[descriptor]

selection = [descriptor, 'bioactivity_class']
df = df_2class[selection]
inactive = df[df.bioactivity_class == 'inactive']
inactive = inactive[descriptor]

### comparing samples
stat, p = mannwhitneyu(active, inactive)
print('Statistics=%.3f, p=%.3f' % (stat, p))

### interpret
alpha = 0.05
if p > alpha:
interpretation = 'Same distribution (fail to reject H0)'
else:
interpretation = 'Different distribution (reject H0)'

results = pd.DataFrame({'Descriptor':descriptor,
'Statistics':stat,
'p':p,
'alpha':alpha,
'Interpretation':interpretation}, index=[0])
#filename = 'mannwhitneyu_' + descriptor + '.csv'
#results.to_csv(filename)

return results





[ ]


mannwhitney('pIC50')









Molecular Weight



[ ]


sns.boxplot(x = 'bioactivity_class', y = 'MW', data = df_2class)

plt.xlabel('Bioactivity class', fontsize=14)
plt.ylabel('MW', fontsize=14)
plt.title('MW to Bioactivity Class', size = 20, fontweight = 'bold')

plt.savefig('plot_MW.png')







[ ]


mannwhitney('MW')









LogP



[ ]


sns.boxplot(x = 'bioactivity_class', y = 'LogP', data = df_2class)

plt.xlabel('Bioactivity class', fontsize=14)
plt.ylabel('LogP', fontsize=14)
plt.title('LogP vs Bioactictivity Class', size = 20, fontweight='bold')

plt.savefig('plot_LogP.png')







[ ]


mannwhitney('LogP')









Number of Hydrogen Donors



[ ]


sns.boxplot(x = 'bioactivity_class', y = 'NumHDonors', data = df_2class)

plt.xlabel('Bioactivity class', fontsize=14)
plt.ylabel('LogP', fontsize=14)
plt.title('Hydrogen Donors vs Bioactictivity Class', size = 20, fontweight='bold')

plt.savefig('plot_NumHDonors.png')







[ ]


mannwhitney('NumHDonors')









Number of Hydrogen Acceptors



[ ]


sns.boxplot(x = 'bioactivity_class', y = 'NumHAcceptors', data = df_2class)

plt.xlabel('Bioactivity class', fontsize=14)
plt.ylabel('LogP', fontsize=14)
plt.title('Hydrogen Acceptors vs Bioactictivity Class', size = 20, fontweight='bold')

plt.savefig('plot_NumHacceptors.png')







[ ]


mannwhitney('NumHAcceptors')









Interpretations
pIC50 ------------ Statistically significant (This is expected as we pre-processed this data to split active and inactive)
MW -------------- Not significant
LogP ------------ Not significant
H Donors ------ Not significant
H Acceptors -- Not significant



[ ]


df_2class.to_csv('df_2class.csv', index=False)







We will now use PaDEL to calculate the molecular descriptors.



[ ]


! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip



df_2class.dropna(inplace=True)



/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.




[ ]


selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)





[ ]


! cat molecule.smi | head -5



Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21 CHEMBL187579 O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21 CHEMBL188487 O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21 CHEMBL185698 O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21 CHEMBL426082 O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-] CHEMBL187717




[ ]


cat molecule.smi | wc -l ### We have 215 molecules to look transcribe.



105






Calculating descriptors



[ ]


! cat padel.sh



java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv






Showing what is inside padel file in above line of code. This uses Java with 1 gig of memory. We removed the sodium and chloride (salts) which are in the chemical structure.



[ ]


! bash padel.sh




Preparing the X and Y Data Matrices
X Matrix



[ ]


df_X_matrix = pd.read_csv('descriptors_output.csv')





[ ]


df_X_matrix







[ ]


df_X_matrix = df_X_matrix.drop(columns=['Name'])
df_X_matrix









Y variable
COnverting IC50 to pIC50



[ ]


df_y = df7['pIC50']
df_y.dropna(inplace=True)





[ ]


df_y



0 5.142668 1 5.026872 2 4.869666 3 4.882397 4 5.698970 5 6.008774 6 5.316953 7 6.022276 8 4.950782 9 4.628932 10 4.900665 11 4.756962 12 4.346787 13 4.154902 14 4.180456 15 6.431798 16 4.903090 17 4.721246 18 4.602060 19 4.148742 20 5.958607 21 4.301030 22 5.522879 23 3.522879 24 3.602060 25 3.698970 26 4.000000 27 4.221849 28 4.346787 29 4.397940 30 4.823909 31 4.823909 32 4.920819 33 3.000000 34 3.301030 35 3.397940 36 3.455932 37 3.522879 38 3.522879 39 3.698970 40 3.698970 41 3.698970 42 3.698970 43 4.221849 44 4.397940 45 4.522879 46 4.823909 47 4.853872 48 4.958607 49 5.000000 50 6.045757 51 5.221849 52 4.920819 53 4.886057 54 4.886057 55 4.823909 56 4.795880 57 4.795880 58 4.795880 59 4.602060 60 4.494850 61 5.522879 62 5.301030 63 5.000000 64 4.823909 65 4.795880 66 4.744727 67 4.744727 68 4.698970 69 4.397940 70 6.522879 71 7.221849 72 7.200659 73 6.468521 74 6.568636 75 7.022276 76 7.187087 77 7.301030 78 6.769551 79 4.050122 80 4.605548 81 4.675718 82 3.644548 83 4.412289 84 4.841638 85 4.675718 86 5.795880 87 4.970616 88 5.036212 89 6.096910 90 5.055517 91 5.309804 92 4.522879 93 4.283997 94 4.057496 95 6.154902 96 5.920819 97 4.220404 98 4.767004 99 4.744727 100 4.974694 101 4.995679 102 4.939302 103 4.970616 104 4.102923 Name: pIC50, dtype: float64






Combining



[ ]


final_dataset = pd.concat([df_X_matrix,df_y], axis=1)





[ ]


missdata = df.isnull().values.any()
print(missdata)



True




[ ]


if missdata == True:
print('# of missing values:', final_dataset.isnull().values.sum())
else:
print('No missing data')



# of missing values: 0




[ ]


final_dataset.isnull().sum()



[ ]


final_dataset = final_dataset.dropna(axis=0) ### Removing one final NaN value found in matrix.





[ ]


final_dataset







[ ]


final_dataset.to_csv('final_dataset,csv', index=False)







Modeling



[ ]


final_dataset = pd.read_csv('final_dataset,csv')





[ ]


final_dataset







[ ]


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor





[ ]


X = final_dataset.drop('pIC50', axis=1).copy()
y = final_dataset.pIC50.copy()





[ ]


X.shape



(105, 881)




[ ]


y.shape



(105,)




[ ]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)





[ ]


model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
R_square = model.score(X_test, y_test)
R_square



0.532986696734584






R-squared values should be between 0.5 and 0.6 for most workable models. So this model is valid and usable.



[ ]


y_pred = model.predict(X_test)







Scatterplot of Predicted vs Experimental pIC50



[ ]


ax = sns.regplot(y_test, y_pred, scatter_kws = {'alpha':0.6})
ax.set_xlabel('Experimental pIC50', fontsize = 'large')
ax.set_ylabel('Predicted pIC50', fontsize='large')
plt.title('Predicted vs Experimental pIC50', size = 18, fontweight = 'bold')
ax.figure.set_size_inches(5, 5)
plt.show







[ ]


import plotly.express as px

fig = px.scatter(y_test, y_pred, color = "pIC50")

fig.update_layout(
height=600,
width=600,
title_text='Predicted vs Experimental pIC50')

fig.show() A good molecule to investigate would have a pIC50 of near 5.5 or greater. This would give a stronger inhibition of the binded protein on SARs virus.
This is the end. As you can see we can predict molecules with a pIC50 greater than 5.5, although a better goal to strive for is 6. Our model is valid with a r-squared value greater than 0.5 as well.

FAKE NEWS DETECTOR

Detecting Fake News with Python and Machine Learning Do you trust all the news you hear from social media?
All news are not real, right?
How will you detect fake news?
The answer is Python. By practicing this advanced python project of detecting fake news, you will easily make a difference between real and fake news.
Before moving ahead in this machine learning project, get aware of the terms related to it like fake news, tfidfvectorizer, PassiveAggressive Classifier.
Also, I like to add that The 21st Century/21st avenu has published a series of machine learning Projects where you will get interesting and open-source advanced ml projects. Do check, and then share your experience through comments. What is Fake News?
A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.
What is a TfidfVectorizer?
TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.
IDF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.
The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.
What is a PassiveAggressiveClassifier?
Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.
Detecting Fake News with Python
To build a model to accurately classify a piece of news as REAL or FAKE.

About Detecting Fake News with Python
This advanced python project of detecting fake news deals with fake and real news. Using sklearn, we build a TfidfVectorizer on our dataset. Then, we initialize a PassiveAggressive Classifier and fit the model. In the end, the accuracy score and the confusion matrix tell us how well our model fares.
The fake news Dataset
The dataset we’ll use for this python project- we’ll call it news.csv. This dataset has a shape of 7796×4. The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE. The dataset takes up 29.2MB of space and you can download it here.
Project Prerequisites
You’ll need to install the following libraries with pip:
pip install numpy pandas sklearn
You’ll need to install Jupyter Lab to run your code. Get to your command prompt and run the following command:
C:\Users\DataFlair>jupyter lab
You’ll see a new browser window open up; create a new console and use it to run your code. To run multiple lines of code at once, press Shift+Enter.
Steps for detecting fake news with Python
Follow the below steps for detecting fake news and complete your first advanced Python Project – Make necessary imports: import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

2. Now, let’s read the data into a DataFrame, and get the shape of the data and the first 5 records.

#Read the data
df=pd.read_csv('D:\\DataFlair\\news.csv')

#Get shape and head
df.shape
df.head()
Output Screenshot:

3. And get the labels from the DataFrame.

# Get the labels
labels=df.label
labels.head()
Output Screenshot:

4. Split the dataset into training and testing sets.
# Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)
Screenshot:

5. Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.
# Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)
6. Next, we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and y_train.

Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.
#- Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

# Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
Output Screenshot:

7. We got an accuracy of 92.82% with this model. Finally, let’s print out a confusion matrix to gain insight into the number of false and true negatives and positives.
#DataFlair - Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

Output Screenshot:

So with this model, we have 589 true positives, 587 true negatives, 42 false positives, and 49 false negatives.
Summary
Today, we learned to detect fake news with Python. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of 92.82% in magnitude.
Hope you enjoyed the fake news detection python project. FOLLOW 21ST AVENUE FOR MORE