Recent Posts

September, 2020

  • 20 September

    Clustering Wikipedia

    Hi, in this article i’ll make a simple clustering example using wikipedia. You can access full code, here: https://drive.google.com/drive/folders/1FKAqwAvaSmEt0jzL3lHu5qQGEcw4FQGS?usp=sharing # Perform the necessary imports from sklearn.decomposition import TruncatedSVD from sklearn.cluster import KMeans from sklearn.pipeline import make_pipeline # Create a TruncatedSVD instance: svd svd = TruncatedSVD(n_components=50) # Create a KMeans instance: …

    Read More »
  • 18 September

    Introduction of DATA WAREHOUSE-What is DATA WAREHOUSE?

    What is the Data Warehouse? A data warehouse is a repository that can be made of questioning and analysis of related data. The data warehouse has been created in order not to tire the database. A data warehouse is actually a database. The data warehouse is set up in order …

    Read More »
  • 18 September

    Python Unsupervised Learning -5

    Hello, in this article, we continue the topic Unsupervised Learning. Dimension reduction Dimension reduction finds patterns in data, and uses these patterns to  re-express it in a compressed form.  This makes subsequent computation with the data much more efficient and this can be a big deal in a world of …

    Read More »
  • 13 September

    Python Unsupervised Learning -4

    I will make a short example about t-SNE in this article. t-SNE visualization of grain dataset from sklearn.manifold import TSNE import pandas as pd import numpy samples =[[15.26 , 14.84 , 0.871 , 5.763 , 3.312 , 2.221 , 5.22 ], [14.88 , 14.57 , 0.8811, 5.554 , 3.333 , …

    Read More »
  • 13 September

    Introduction of DATA WAREHOUSE-What is DATA?

    What is Data? This word, which has a very high popularity, is actually called data, each letter number or date information entered in the computers we use as technology and the applications we use in them. Everything we encounter in daily life actually contains data. For example; Your tc, phone …

    Read More »
  • 13 September

    Oracle XE Installation on Hortonworks Data Flow (HDF)

    Hi, in this artile, i will show you how to install Oracle Express Edition (XE) on HDF (Hortonworks Data Platform). First of all, I assume that HDF platform is installed in your Virtual machine (Oravle VM or VMware), connect to the virtual machine with ssh from the web browser or …

    Read More »
  • 9 September

    Apache Nifi on Google Cloud

    Hello, in this article I will explain how to install Apache Nifi on Google Cloud. First, you have to create a Google Cloud account. I assume you have done this step, you need to create a virtual machine   Click create new instance.     I recommend using Ubuntu 18.04 …

    Read More »
  • 8 September

    Introduction to gensim (Python)

    What is gensim? Popular open-source NLP library Uses top academic models to perform complex tasks Building document or word vectors Performing topic identification and document comparison A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document. For example in …

    Read More »
  • 7 September

    Introduction to Natural Language Processing in Python – (Simple text preprocessing)

    Why preprocess ? Helps make for better input data When performing machine learning or other statistical methods Examples: Tokenization to create a bag of words Lowercasting words Lemmetization/Stemming Shorten words to their root stems Removing stop words, punctuation or unwanted tokens Good to experiment with different approaches   Text preprocessing …

    Read More »
  • 7 September

    Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

    Bag-of-words Bag of words is a very simple and basic method to finding topics in  a text. For bag of words, you need to first create tokens using tokenization, and then count  up all the tokens you have. The theory is that the more frequent a word or token is, …

    Read More »
  • 7 September

    Python Unsupervised Learning -3

    Hi, we continue where we left off on Unsupervised Learning. I recommend that you read our previous article before moving on to this article. Python Unsupervised Learning -2   Transforming Features For Better Clustering Let’s look now at another dataset, the Piedmont wines dataset. 178 samples from 3 distinct varieties …

    Read More »
  • 6 September

    Python Unsupervised Learning -2

    Hi, In this article, we continue where we left off from the previous topic. If you haven’t read the previous article, you can find it here. Python Unsupervised Learning -1   Evaluating a Clustering In the previous article we used k-means to cluster the sample dataset into the three cluster. …

    Read More »
  • 6 September

    Python Unsupervised Learning -1

    In this series of articles, I will explain the topic of Unsupervised Learning and make examples of it. Unsupervised learning is a class of machine learning techniques for discovering patterns in data. For instance, finding the natural “clusters” of customers based on their purchase histories, or searching for patterns and …

    Read More »

August, 2020

  • 12 August

    Data Warehouse Architectures

    Data Warehouse Architectures I would like to talk about the two most important models of the Data Warehouse architect. These models are Bill Inmon and Kimballs models. I will not recommend which one to use in this article. We will compare 2 models Inmon’s Enterprise Data Warehouse These systems feed …

    Read More »

December, 2019

  • 24 December

    Microsoft Azure Open Source Big Data & Analytic Service – HDInsight

    Hi everyone, In this article, I wanted to talk about a very useful service of Microsoft Azure. I recommend that you check out the previous article before proceeding with this article. Apache Kafka Producer Example With Java   HDInsight HDInsight provides an environment where you can use applications such as …

    Read More »

November, 2019

  • 27 November

    Apache Kafka Producer Example With Java

    Hi, everyone in this tutorial we will make Kafka producer example with Java. There are multiple language options for writing code with Kafka producer. You can use Java, Scala, or python. Kafka and Zookeeper must be installed on your computer before proceeding with writing the code.  You can download a …

    Read More »
  • 2 November

    Import table Mysql to HDFS Using Apache Sqoop

    Hi everyone, In this article, I will transfer a table on Mysql to the HDFS file system using sqoop. Sqoop provides the ability to transfer from any RDBMS to an HDFS system. Now let’s continue with the example, Let’s download a sample set from kaggle Move the csv file to the virtual …

    Read More »
  • 2 November

    Big Data – Import .csv to Hive

    Hi everyone,  In this article we will see how to add a dataset we downloaded from kaggle as a Hive table. Hive is not a database.  This is to make use of SQL capabilities by defining a metadata to the files in HDFS.  Long story short, it brings the possibility …

    Read More »