{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## এক্সপ্লোরেটরি ডেটা অ্যানালাইসিস \n",
"রিভিশন ৪"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"আসলে আমাদের ডেটার ভেতরে কী আছে সেটা না জানলে এর থেকে প্রেডিকশন বের করবো কী করে? সেকারণে এই এক্সপ্লোরেশন। ডেটা নিয়ে একটু ঘাঁটাঘাঁটি করলে এর ভেতরের অনেক ধারণা পাওয়া যায় যেটা মডেল সিলেকশন অথবা ফীচারগুলো বুঝতে সুবিধা হয়। আগের চ্যাপ্টারের ভেতরে কিছুটা \"এক্সপ্লোরেটরি ডেটা অ্যানালাইসিস\" করলেও এখানে সেটাকে আরেকটু খোলাসা করছি। "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ডাটার শেপ, মানে কতোটা ইনস্ট্যান্স?"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"n_samples, n_features = iris.data.shape"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"150"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n_samples"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n_features"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of data: (150, 4)\n"
]
}
],
"source": [
"print(\"Shape of data:\", iris['data'].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"কোন ডাটা মিসিং নেই "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(iris.target) == n_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ফিচারগুলোর নাম \n",
"\n",
"ওপরের ছবিতে চারটা ফিচারের নাম দেখেছি। চলুন দেখি সেগুলো আমাদের ডাটাসেট অবজেক্টে। iris এর পর ডট নোটেশন ব্যবহার করে ডাকি একটা \"কী\" ভ্যালুকে। feature_names হচ্ছে আমাদের iris.keys() থেকে পাওয়া একটা অ্যাট্রিবিউট।"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sepal length (cm)',\n",
" 'sepal width (cm)',\n",
" 'petal length (cm)',\n",
" 'petal width (cm)']"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.feature_names"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n"
]
}
],
"source": [
"print(iris['feature_names'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### টার্গেট অর্থাৎ কী প্রেডিক্ট করতে চাই আমরা?\n",
"\n",
"অনেকভাবেই করা সম্ভব। তবে print ফরম্যাটিং এ ভালো কাজ করে। "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['setosa', 'versicolor', 'virginica'],\n",
" dtype='\n",
"\n"
]
}
],
"source": [
"print(type(iris.data))\n",
"print(type(iris.target))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ফিচারের ম্যাট্রিক্স কি? (১ম ডাইমেনশন = অবজার্ভেশনের সংখ্যা, ২য় = ফিচারের সংখ্যা)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150, 4)\n"
]
}
],
"source": [
"print(iris.data.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"টার্গেট ম্যাট্রিক্স কি? (১ম ডাইমেনশন = লেবেল, টার্গেট, রেসপন্স)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150,)\n"
]
}
],
"source": [
"print(iris.target.shape)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of target: (150,)\n"
]
}
],
"source": [
"print(\"Shape of target:\", iris['target'].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### সাইকিট-লার্ন এ ডাটা হ্যান্ডলিং এর নিয়ম \n",
"\n",
"১. এখানে \"ফিচার\" এবং \"রেসপন্স\" দুটো আলাদা অবজেক্ট \n",
"(আমাদের এখানে দেখুন, \"ফিচার\" এবং \"রেসপন্স\" মানে \"টার্গেট\" আলাদা অবজেক্ট)\n",
"\n",
"২. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই সংখ্যা হতে হবে \n",
"(আমাদের এখানে দুটোই সংখ্যার, দুটোর ম্যাট্রিক্স ডাইমেনশন হচ্ছে (১৫০ x ৪) এবং (১৫০ x ১)\n",
"\n",
"৩. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই \"নামপাই অ্যারে\" হতে হবে। \n",
"(আমাদের দুটো ফিচারই আছে \"নামপাই অ্যারে\"তে, বাকি ডাটা ডাটাসেট দরকার হলে সেটাকেও লোড করে নিতে হবে \"নামপাই অ্যারে\"তে)\n",
"\n",
"৪. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই স্পেসিফিক shape হতে হবে \n",
"\n",
"* ১৫০ x ৪ -> পুরো ডাটাসেট \n",
"* ১৫০ x ১ টার্গেটের জন্য \n",
"* ৪ x ১ ফিচারের জন্য \n",
"* আমরা ইচ্ছা করলে যেকোন ম্যাট্রিক্স পাল্টে নিতে পারি আমাদের দরকার মতো। যেমন np.tile(a, [4, 1]), মানে a হচ্ছে ম্যাট্রিক্স আর [4, 1] হচ্ছে ইনডেন্ট ম্যাট্রিক্স আরেক ডাইমেনশনে। "
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# ফিচার ম্যাট্রিক্স স্টোর করছি বড় \"X\"এ, মনে আছে f(x)=y কথা? x ইনপুট হলে y আউটপুট \n",
"X = iris.data\n",
"\n",
"# রেসপন্স ভেক্টর রাখছি \"y\" তে \n",
"y = iris.target"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 5.1, 3.5, 1.4, 0.2],\n",
" [ 4.9, 3. , 1.4, 0.2],\n",
" [ 4.7, 3.2, 1.3, 0.2],\n",
" [ 4.6, 3.1, 1.5, 0.2],\n",
" [ 5. , 3.6, 1.4, 0.2],\n",
" [ 5.4, 3.9, 1.7, 0.4],\n",
" [ 4.6, 3.4, 1.4, 0.3],\n",
" [ 5. , 3.4, 1.5, 0.2],\n",
" [ 4.4, 2.9, 1.4, 0.2],\n",
" [ 4.9, 3.1, 1.5, 0.1],\n",
" [ 5.4, 3.7, 1.5, 0.2],\n",
" [ 4.8, 3.4, 1.6, 0.2],\n",
" [ 4.8, 3. , 1.4, 0.1],\n",
" [ 4.3, 3. , 1.1, 0.1],\n",
" [ 5.8, 4. , 1.2, 0.2],\n",
" [ 5.7, 4.4, 1.5, 0.4],\n",
" [ 5.4, 3.9, 1.3, 0.4],\n",
" [ 5.1, 3.5, 1.4, 0.3],\n",
" [ 5.7, 3.8, 1.7, 0.3],\n",
" [ 5.1, 3.8, 1.5, 0.3],\n",
" [ 5.4, 3.4, 1.7, 0.2],\n",
" [ 5.1, 3.7, 1.5, 0.4],\n",
" [ 4.6, 3.6, 1. , 0.2],\n",
" [ 5.1, 3.3, 1.7, 0.5],\n",
" [ 4.8, 3.4, 1.9, 0.2],\n",
" [ 5. , 3. , 1.6, 0.2],\n",
" [ 5. , 3.4, 1.6, 0.4],\n",
" [ 5.2, 3.5, 1.5, 0.2],\n",
" [ 5.2, 3.4, 1.4, 0.2],\n",
" [ 4.7, 3.2, 1.6, 0.2],\n",
" [ 4.8, 3.1, 1.6, 0.2],\n",
" [ 5.4, 3.4, 1.5, 0.4],\n",
" [ 5.2, 4.1, 1.5, 0.1],\n",
" [ 5.5, 4.2, 1.4, 0.2],\n",
" [ 4.9, 3.1, 1.5, 0.1],\n",
" [ 5. , 3.2, 1.2, 0.2],\n",
" [ 5.5, 3.5, 1.3, 0.2],\n",
" [ 4.9, 3.1, 1.5, 0.1],\n",
" [ 4.4, 3. , 1.3, 0.2],\n",
" [ 5.1, 3.4, 1.5, 0.2],\n",
" [ 5. , 3.5, 1.3, 0.3],\n",
" [ 4.5, 2.3, 1.3, 0.3],\n",
" [ 4.4, 3.2, 1.3, 0.2],\n",
" [ 5. , 3.5, 1.6, 0.6],\n",
" [ 5.1, 3.8, 1.9, 0.4],\n",
" [ 4.8, 3. , 1.4, 0.3],\n",
" [ 5.1, 3.8, 1.6, 0.2],\n",
" [ 4.6, 3.2, 1.4, 0.2],\n",
" [ 5.3, 3.7, 1.5, 0.2],\n",
" [ 5. , 3.3, 1.4, 0.2],\n",
" [ 7. , 3.2, 4.7, 1.4],\n",
" [ 6.4, 3.2, 4.5, 1.5],\n",
" [ 6.9, 3.1, 4.9, 1.5],\n",
" [ 5.5, 2.3, 4. , 1.3],\n",
" [ 6.5, 2.8, 4.6, 1.5],\n",
" [ 5.7, 2.8, 4.5, 1.3],\n",
" [ 6.3, 3.3, 4.7, 1.6],\n",
" [ 4.9, 2.4, 3.3, 1. ],\n",
" [ 6.6, 2.9, 4.6, 1.3],\n",
" [ 5.2, 2.7, 3.9, 1.4],\n",
" [ 5. , 2. , 3.5, 1. ],\n",
" [ 5.9, 3. , 4.2, 1.5],\n",
" [ 6. , 2.2, 4. , 1. ],\n",
" [ 6.1, 2.9, 4.7, 1.4],\n",
" [ 5.6, 2.9, 3.6, 1.3],\n",
" [ 6.7, 3.1, 4.4, 1.4],\n",
" [ 5.6, 3. , 4.5, 1.5],\n",
" [ 5.8, 2.7, 4.1, 1. ],\n",
" [ 6.2, 2.2, 4.5, 1.5],\n",
" [ 5.6, 2.5, 3.9, 1.1],\n",
" [ 5.9, 3.2, 4.8, 1.8],\n",
" [ 6.1, 2.8, 4. , 1.3],\n",
" [ 6.3, 2.5, 4.9, 1.5],\n",
" [ 6.1, 2.8, 4.7, 1.2],\n",
" [ 6.4, 2.9, 4.3, 1.3],\n",
" [ 6.6, 3. , 4.4, 1.4],\n",
" [ 6.8, 2.8, 4.8, 1.4],\n",
" [ 6.7, 3. , 5. , 1.7],\n",
" [ 6. , 2.9, 4.5, 1.5],\n",
" [ 5.7, 2.6, 3.5, 1. ],\n",
" [ 5.5, 2.4, 3.8, 1.1],\n",
" [ 5.5, 2.4, 3.7, 1. ],\n",
" [ 5.8, 2.7, 3.9, 1.2],\n",
" [ 6. , 2.7, 5.1, 1.6],\n",
" [ 5.4, 3. , 4.5, 1.5],\n",
" [ 6. , 3.4, 4.5, 1.6],\n",
" [ 6.7, 3.1, 4.7, 1.5],\n",
" [ 6.3, 2.3, 4.4, 1.3],\n",
" [ 5.6, 3. , 4.1, 1.3],\n",
" [ 5.5, 2.5, 4. , 1.3],\n",
" [ 5.5, 2.6, 4.4, 1.2],\n",
" [ 6.1, 3. , 4.6, 1.4],\n",
" [ 5.8, 2.6, 4. , 1.2],\n",
" [ 5. , 2.3, 3.3, 1. ],\n",
" [ 5.6, 2.7, 4.2, 1.3],\n",
" [ 5.7, 3. , 4.2, 1.2],\n",
" [ 5.7, 2.9, 4.2, 1.3],\n",
" [ 6.2, 2.9, 4.3, 1.3],\n",
" [ 5.1, 2.5, 3. , 1.1],\n",
" [ 5.7, 2.8, 4.1, 1.3],\n",
" [ 6.3, 3.3, 6. , 2.5],\n",
" [ 5.8, 2.7, 5.1, 1.9],\n",
" [ 7.1, 3. , 5.9, 2.1],\n",
" [ 6.3, 2.9, 5.6, 1.8],\n",
" [ 6.5, 3. , 5.8, 2.2],\n",
" [ 7.6, 3. , 6.6, 2.1],\n",
" [ 4.9, 2.5, 4.5, 1.7],\n",
" [ 7.3, 2.9, 6.3, 1.8],\n",
" [ 6.7, 2.5, 5.8, 1.8],\n",
" [ 7.2, 3.6, 6.1, 2.5],\n",
" [ 6.5, 3.2, 5.1, 2. ],\n",
" [ 6.4, 2.7, 5.3, 1.9],\n",
" [ 6.8, 3. , 5.5, 2.1],\n",
" [ 5.7, 2.5, 5. , 2. ],\n",
" [ 5.8, 2.8, 5.1, 2.4],\n",
" [ 6.4, 3.2, 5.3, 2.3],\n",
" [ 6.5, 3. , 5.5, 1.8],\n",
" [ 7.7, 3.8, 6.7, 2.2],\n",
" [ 7.7, 2.6, 6.9, 2.3],\n",
" [ 6. , 2.2, 5. , 1.5],\n",
" [ 6.9, 3.2, 5.7, 2.3],\n",
" [ 5.6, 2.8, 4.9, 2. ],\n",
" [ 7.7, 2.8, 6.7, 2. ],\n",
" [ 6.3, 2.7, 4.9, 1.8],\n",
" [ 6.7, 3.3, 5.7, 2.1],\n",
" [ 7.2, 3.2, 6. , 1.8],\n",
" [ 6.2, 2.8, 4.8, 1.8],\n",
" [ 6.1, 3. , 4.9, 1.8],\n",
" [ 6.4, 2.8, 5.6, 2.1],\n",
" [ 7.2, 3. , 5.8, 1.6],\n",
" [ 7.4, 2.8, 6.1, 1.9],\n",
" [ 7.9, 3.8, 6.4, 2. ],\n",
" [ 6.4, 2.8, 5.6, 2.2],\n",
" [ 6.3, 2.8, 5.1, 1.5],\n",
" [ 6.1, 2.6, 5.6, 1.4],\n",
" [ 7.7, 3. , 6.1, 2.3],\n",
" [ 6.3, 3.4, 5.6, 2.4],\n",
" [ 6.4, 3.1, 5.5, 1.8],\n",
" [ 6. , 3. , 4.8, 1.8],\n",
" [ 6.9, 3.1, 5.4, 2.1],\n",
" [ 6.7, 3.1, 5.6, 2.4],\n",
" [ 6.9, 3.1, 5.1, 2.3],\n",
" [ 5.8, 2.7, 5.1, 1.9],\n",
" [ 6.8, 3.2, 5.9, 2.3],\n",
" [ 6.7, 3.3, 5.7, 2.5],\n",
" [ 6.7, 3. , 5.2, 2.3],\n",
" [ 6.3, 2.5, 5. , 1.9],\n",
" [ 6.5, 3. , 5.2, 2. ],\n",
" [ 6.2, 3.4, 5.4, 2.3],\n",
" [ 5.9, 3. , 5.1, 1.8]])"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}