این صفحه به‌وسیله ‏Cloud Translation API‏ ترجمه شده است.

واژه نامه یادگیری ماشینی: مبانی ML
با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

این صفحه شامل اصطلاحات واژه نامه اصول ML است. برای همه اصطلاحات واژه نامه، اینجا را کلیک کنید .

الف

دقت

#مبانی

#متریک

تعداد پیش‌بینی‌های طبقه‌بندی صحیح تقسیم بر تعداد کل پیش‌بینی‌ها. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

به عنوان مثال، مدلی که 40 پیش‌بینی درست و 10 پیش‌بینی نادرست داشته باشد، دقتی برابر با:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

طبقه بندی باینری نام های خاصی را برای دسته های مختلف پیش بینی های صحیح و پیش بینی های نادرست ارائه می دهد. بنابراین، فرمول دقت برای طبقه بندی باینری به شرح زیر است:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP تعداد مثبت های واقعی (پیش بینی های صحیح) است.
TN تعداد منفی های واقعی (پیش بینی های صحیح) است.
FP تعداد مثبت کاذب (پیش‌بینی‌های نادرست) است.
FN تعداد منفی های کاذب (پیش بینی های نادرست) است.

مقایسه و مقایسه دقت با دقت و یادآوری .

برای جزئیات در مورد دقت و مجموعه داده های نامتعادل کلاس، روی نماد کلیک کنید.

اگرچه برای برخی موقعیت‌ها یک معیار ارزشمند است، اما دقت برای برخی دیگر بسیار گمراه‌کننده است. قابل ذکر است که دقت معمولاً معیار ضعیفی برای ارزیابی مدل‌های طبقه‌بندی است که مجموعه داده‌های نامتعادل کلاس را پردازش می‌کنند.

برای مثال، فرض کنید در یک شهر نیمه گرمسیری خاص، تنها 25 روز در قرن برف می بارد. از آنجایی که روزهای بدون برف (طبقه منفی) بسیار بیشتر از روزهای با برف (طبقه مثبت) است، مجموعه داده های برف برای این شهر از نظر طبقه نامتعادل است. یک مدل طبقه‌بندی باینری را تصور کنید که قرار است هر روز برف یا بدون برف را پیش‌بینی کند، اما به سادگی هر روز «بدون برف» را پیش‌بینی می‌کند. این مدل بسیار دقیق است اما قدرت پیش بینی ندارد. جدول زیر نتایج یک قرن پیش‌بینی را خلاصه می‌کند:

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

بنابراین دقت این مدل عبارت است از:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

اگرچه دقت 99.93 درصد بسیار چشمگیر به نظر می رسد، این مدل در واقع قدرت پیش بینی ندارد.

دقت و یادآوری معمولاً معیارهای مفیدتری نسبت به دقت برای ارزیابی مدل‌های آموزش دیده بر روی مجموعه داده‌های نامتعادل کلاس هستند.

برای اطلاعات بیشتر به طبقه بندی: دقت، یادآوری، دقت و معیارهای مرتبط در دوره تصادف یادگیری ماشین مراجعه کنید.

عملکرد فعال سازی

#مبانی

تابعی که شبکه های عصبی را قادر می سازد تا روابط غیرخطی (پیچیده) بین ویژگی ها و برچسب را بیاموزند.

توابع فعال سازی محبوب عبارتند از:

ReLU
سیگموئید

نمودار توابع فعال سازی هرگز خطوط مستقیم منفرد نیستند. به عنوان مثال، نمودار تابع فعال سازی ReLU از دو خط مستقیم تشکیل شده است:

طرح دکارتی از دو خط. خط اول یک ثابت دارد مقدار y از 0، در امتداد محور x از -infinity، 0 تا 0،-0 اجرا می شود. خط دوم از 0.0 شروع می شود. این خط دارای شیب +1 است، بنابراین از 0،0 تا + بی نهایت، + بی نهایت اجرا می شود.

نمودار تابع فعال سازی سیگموئید به صورت زیر است:

یک نمودار منحنی دو بعدی با مقادیر x در دامنه -infinity تا + مثبت، در حالی که مقادیر y محدوده تقریباً 0 تا را در بر می گیرند تقریباً 1. وقتی x 0 است، y 0.5 است. شیب منحنی همیشه است مثبت، با بیشترین شیب 0.0.5 و به تدریج کاهش می یابد با افزایش قدر مطلق x شیب می شود.

برای مشاهده نمونه روی نماد کلیک کنید.

در یک شبکه عصبی، توابع فعال سازی مجموع وزنی تمام ورودی های یک نورون را دستکاری می کنند. برای محاسبه یک جمع وزنی، نورون حاصل جمع مقادیر و وزن های مربوطه را جمع می کند. برای مثال، فرض کنید ورودی مربوط به یک نورون شامل موارد زیر است:

مقدار ورودی	وزن ورودی
2	-1.3
-1	0.6
3	0.4

بنابراین مجموع وزنی عبارت است از:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

فرض کنید طراح این شبکه عصبی تابع سیگموئید را به عنوان تابع فعال سازی انتخاب کرده است. در آن صورت، نورون سیگموئید -2.0 را محاسبه می کند که تقریباً 0.12 است. بنابراین، نورون 0.12 (به جای -2.0) را به لایه بعدی در شبکه عصبی منتقل می کند. شکل زیر بخش مربوط به فرآیند را نشان می دهد:

برای اطلاعات بیشتر ، شبکه‌های عصبی: توابع فعال‌سازی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

هوش مصنوعی

#مبانی

یک برنامه یا مدل غیر انسانی که می تواند کارهای پیچیده را حل کند. برای مثال، برنامه یا مدلی که متن را ترجمه می کند یا برنامه یا مدلی که بیماری ها را از تصاویر رادیولوژیک شناسایی می کند، هر دو هوش مصنوعی را نشان می دهند.

به طور رسمی، یادگیری ماشین زیر شاخه هوش مصنوعی است. با این حال، در سال‌های اخیر، برخی از سازمان‌ها شروع به استفاده از اصطلاحات هوش مصنوعی و یادگیری ماشینی کرده‌اند.

AUC (مساحت زیر منحنی ROC)

#مبانی

#متریک

عددی بین 0.0 و 1.0 نشان دهنده توانایی یک مدل طبقه بندی باینری برای جداسازی کلاس های مثبت از کلاس های منفی است. هر چه AUC به 1.0 نزدیکتر باشد، مدل توانایی بهتری برای جداسازی کلاس ها از یکدیگر دارد.

برای مثال، تصویر زیر یک مدل طبقه‌بندی را نشان می‌دهد که کلاس‌های مثبت (بیضی‌های سبز) را از کلاس‌های منفی (مستطیل‌های بنفش) کاملاً جدا می‌کند. این مدل غیرواقعی کامل دارای AUC 1.0 است:

یک خط اعداد با 8 مثال مثبت در یک طرف و 9 مثال منفی در طرف دیگر.

برعکس، تصویر زیر نتایج یک مدل طبقه‌بندی را نشان می‌دهد که نتایج تصادفی ایجاد می‌کند. این مدل دارای AUC 0.5 است:

یک خط اعداد با 6 مثال مثبت و 6 مثال منفی. دنباله مثال ها مثبت، منفی است، مثبت، منفی، مثبت، منفی، مثبت، منفی، مثبت منفی، مثبت، منفی

بله، مدل قبلی دارای AUC 0.5 است، نه 0.0.

اکثر مدل ها جایی بین دو حالت افراطی هستند. به عنوان مثال، مدل زیر موارد مثبت را تا حدودی از منفی جدا می کند و بنابراین دارای AUC بین 0.5 و 1.0 است:

یک خط اعداد با 6 مثال مثبت و 6 مثال منفی. دنباله مثال ها منفی، منفی، منفی، منفی، مثبت، منفی، مثبت، مثبت، منفی، مثبت، مثبت، مثبت

AUC هر مقداری را که برای آستانه طبقه بندی تنظیم کرده اید نادیده می گیرد. در عوض، AUC تمام آستانه های طبقه بندی ممکن را در نظر می گیرد.

برای اطلاع از رابطه بین منحنی های AUC و ROC روی نماد کلیک کنید.

AUC نشان دهنده سطح زیر منحنی ROC است. به عنوان مثال، منحنی ROC برای مدلی که به طور کامل نکات مثبت را از منفی جدا می کند، به صورت زیر است:

AUC ناحیه خاکستری در تصویر قبل است. در این حالت غیر معمول، مساحت به سادگی طول ناحیه خاکستری (1.0) ضرب در عرض ناحیه خاکستری (1.0) است. بنابراین، حاصل ضرب 1.0 و 1.0 AUC دقیقاً 1.0 را به دست می دهد که بالاترین امتیاز AUC ممکن است.

برعکس، منحنی ROC برای یک مدل طبقه بندی که به هیچ وجه نمی تواند کلاس ها را از هم جدا کند، به شرح زیر است. مساحت این منطقه خاکستری 0.5 است.

یک منحنی معمولی ROC تقریباً شبیه زیر است:

محاسبه مساحت زیر این منحنی به صورت دستی دشوار خواهد بود، به همین دلیل است که یک برنامه معمولاً بیشتر مقادیر AUC را محاسبه می کند.

برای تعریف رسمی تر AUC روی نماد کلیک کنید.

AUC احتمال این است که یک مدل طبقه بندی مطمئن تر از مثبت بودن یک مثال تصادفی مثبت باشد تا اینکه یک مثال منفی تصادفی انتخاب شده مثبت باشد.

برای اطلاعات بیشتر به طبقه بندی: ROC و AUC در دوره تصادف یادگیری ماشینی مراجعه کنید.

ب

پس انتشار

#مبانی

الگوریتمی که نزول گرادیان را در شبکه های عصبی پیاده سازی می کند.

آموزش یک شبکه عصبی شامل تکرارهای زیادی از چرخه دو پاس زیر است:

در طول پاس رو به جلو ، سیستم دسته‌ای از نمونه‌ها را پردازش می‌کند تا پیش‌بینی (های) را به دست آورد. سیستم هر پیش بینی را با هر برچسب مقایسه می کند. تفاوت بین مقدار پیش‌بینی و برچسب، ضرر آن مثال است. سیستم تلفات را برای همه نمونه‌ها جمع‌آوری می‌کند تا مجموع ضرر را برای دسته فعلی محاسبه کند.
در طول گذر به عقب (انتشار عقب)، سیستم با تنظیم وزن تمام نورون ها در تمام لایه(های) پنهان، تلفات را کاهش می دهد.

شبکه‌های عصبی اغلب حاوی نورون‌های زیادی در لایه‌های پنهان بسیاری هستند. هر یک از این نورون ها به روش های مختلفی در از دست دادن کلی نقش دارند. انتشار معکوس تعیین می کند که آیا وزن اعمال شده روی نورون های خاص افزایش یا کاهش یابد.

نرخ یادگیری یک ضریب است که میزان افزایش یا کاهش هر وزنه توسط هر پاس به عقب را کنترل می کند. نرخ یادگیری زیاد هر وزن را بیش از یک نرخ یادگیری کوچک افزایش یا کاهش می دهد.

از نظر حساب دیفرانسیل و انتگرال، پس انتشار قانون زنجیره را اجرا می کند. از حساب دیفرانسیل و انتگرال یعنی پس انتشار مشتق جزئی خطا را با توجه به هر پارامتر محاسبه می کند.

سال‌ها پیش، تمرین‌کنندگان ML مجبور بودند کدی را برای پیاده‌سازی انتشار پس‌انداز بنویسند. API های مدرن ML مانند Keras اکنون پس انتشار را برای شما پیاده سازی می کنند. اوه!

برای اطلاعات بیشتر ، شبکه های عصبی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

دسته ای

#مبانی

مجموعه مثال های مورد استفاده در یک تکرار آموزشی. اندازه دسته تعداد نمونه ها را در یک دسته تعیین می کند.

برای توضیح نحوه ارتباط یک دسته با یک دوره، به epoch مراجعه کنید.

برای اطلاعات بیشتر به رگرسیون خطی: Hyperparameters in Machine Learning Crash Course مراجعه کنید.

اندازه دسته

#مبانی

تعداد نمونه ها در یک دسته . به عنوان مثال، اگر اندازه دسته 100 باشد، مدل در هر تکرار 100 نمونه را پردازش می کند.

استراتژی های اندازه دسته ای محبوب زیر هستند:

نزول گرادیان تصادفی (SGD) که در آن اندازه دسته 1 است.
دسته کامل، که در آن اندازه دسته، تعداد نمونه‌های کل مجموعه آموزشی است. به عنوان مثال، اگر مجموعه آموزشی حاوی یک میلیون مثال باشد، اندازه دسته ای یک میلیون نمونه خواهد بود. دسته کامل معمولا یک استراتژی ناکارآمد است.
مینی بچ که در آن اندازه دسته معمولا بین 10 تا 1000 است. مینی بچ معمولا کارآمدترین استراتژی است.

برای اطلاعات بیشتر به ادامه مطلب مراجعه کنید:

سیستم‌های ML تولید: استنتاج استاتیک در مقابل پویا در دوره تصادف یادگیری ماشین.
کتاب راهنما تنظیم یادگیری عمیق .

تعصب (اخلاق / انصاف)

#مسئول

#مبانی

1. کلیشه سازی، تعصب یا طرفداری نسبت به برخی چیزها، افراد یا گروه ها نسبت به دیگران. این سوگیری ها می توانند بر جمع آوری و تفسیر داده ها، طراحی یک سیستم و نحوه تعامل کاربران با یک سیستم تأثیر بگذارند. اشکال این نوع سوگیری عبارتند از:

2. خطای سیستماتیک معرفی شده توسط یک روش نمونه گیری یا گزارش. اشکال این نوع سوگیری عبارتند از:

نباید با اصطلاح سوگیری در مدل‌های یادگیری ماشین یا سوگیری پیش‌بینی اشتباه گرفته شود.

برای اطلاعات بیشتر به Fairness: Types of Bias in Machine Learning Crash Course مراجعه کنید.

تعصب (ریاضی) یا اصطلاح سوگیری

#مبانی

رهگیری یا جبران از مبدأ. تعصب یک پارامتر در مدل های یادگیری ماشینی است که با یکی از موارد زیر نشان داده می شود:

ب
w ₀

به عنوان مثال، بایاس b در فرمول زیر است:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

در یک خط دوبعدی ساده، بایاس فقط به معنای «قطعه y» است. به عنوان مثال، بایاس خط در تصویر زیر 2 است.

نمودار یک خط با شیب 0.5 و بایاس (برق y) 2.

تعصب وجود دارد زیرا همه مدل ها از مبدا (0,0) شروع نمی شوند. به عنوان مثال، فرض کنید یک پارک تفریحی برای ورود به آن 2 یورو و برای هر ساعت اقامت مشتری 0.5 یورو اضافی هزینه دارد. بنابراین، مدلی که هزینه کل را نگاشت می کند، بایاس 2 دارد زیرا کمترین هزینه 2 یورو است.

سوگیری نباید با سوگیری در اخلاق و انصاف یا سوگیری پیش بینی اشتباه شود.

برای اطلاعات بیشتر به رگرسیون خطی در دوره تصادف یادگیری ماشین مراجعه کنید.

طبقه بندی باینری

#مبانی

یک نوع کار طبقه بندی که یکی از دو کلاس منحصر به فرد را پیش بینی می کند:

طبقه مثبت
طبقه منفی

به عنوان مثال، دو مدل یادگیری ماشین زیر، هر کدام دسته بندی باینری را انجام می دهند:

مدلی که تعیین می‌کند پیام‌های ایمیل هرزنامه هستند (کلاس مثبت) یا اسپم نیستند (کلاس منفی).
مدلی که علائم پزشکی را ارزیابی می کند تا مشخص کند آیا یک فرد دارای یک بیماری خاص (طبقه مثبت) است یا آن بیماری (طبقه منفی) را ندارد.

در تقابل با طبقه بندی چند طبقه .

همچنین به رگرسیون لجستیک و آستانه طبقه بندی مراجعه کنید.

برای اطلاعات بیشتر به طبقه بندی در دوره تصادف یادگیری ماشین مراجعه کنید.

سطل سازی

#مبانی

تبدیل یک ویژگی واحد به چندین ویژگی باینری به نام سطل یا bins ، که معمولاً بر اساس یک محدوده مقدار است. ویژگی خرد شده معمولاً یک ویژگی پیوسته است.

به عنوان مثال، به جای نمایش دما به عنوان یک ویژگی ممیز شناور منفرد، می توانید محدوده دما را به سطل های مجزا تقسیم کنید، مانند:

<= 10 درجه سانتیگراد سطل "سرد" خواهد بود.
11 تا 24 درجه سانتیگراد سطل "معتدل" خواهد بود.
>= 25 درجه سانتیگراد سطل "گرم" خواهد بود.

مدل با هر مقدار در یک سطل یکسان رفتار می کند. به عنوان مثال، مقادیر 13 و 22 هر دو در سطل معتدل هستند، بنابراین مدل با دو مقدار یکسان رفتار می کند.

برای یادداشت های اضافی روی نماد کلیک کنید.

اگر دما را به عنوان یک ویژگی پیوسته نشان دهید، مدل دما را به عنوان یک ویژگی واحد در نظر می گیرد. اگر دما را به عنوان سه سطل نشان دهید، مدل هر سطل را به عنوان یک ویژگی جداگانه در نظر می گیرد. یعنی یک مدل می تواند روابط جداگانه هر سطل را با برچسب یاد بگیرد. به عنوان مثال، یک مدل رگرسیون خطی می تواند وزن های جداگانه ای را برای هر سطل یاد بگیرد.

افزایش تعداد سطل ها با افزایش تعداد روابطی که مدل شما باید یاد بگیرد، مدل شما را پیچیده تر می کند. به عنوان مثال، سطل های سرد، معتدل و گرم اساساً سه ویژگی مجزا برای مدل شما هستند که می توانند با آن تمرین کنند. اگر تصمیم دارید دو سطل دیگر اضافه کنید - به عنوان مثال، انجماد و داغ - مدل شما اکنون باید روی پنج ویژگی جداگانه آموزش ببیند.

چگونه می دانید که چند سطل ایجاد کنید یا محدوده هر سطل چقدر باید باشد؟ پاسخ‌ها معمولاً به مقدار کافی آزمایش نیاز دارند.

برای اطلاعات بیشتر به داده‌های عددی: Binning in Machine Learning Crash Course مراجعه کنید.

سی

داده های طبقه بندی شده

#مبانی

ویژگی هایی که مجموعه خاصی از مقادیر ممکن را دارند. به عنوان مثال، یک ویژگی طبقه بندی به نام traffic-light-state را در نظر بگیرید که فقط می تواند یکی از سه مقدار ممکن زیر را داشته باشد:

red
yellow
green

با نشان دادن traffic-light-state به عنوان یک ویژگی طبقه‌بندی، یک مدل می‌تواند تأثیرات متفاوت red ، green و yellow بر رفتار راننده بیاموزد.

ویژگی‌های طبقه‌بندی گاهی اوقات ویژگی‌های گسسته نامیده می‌شوند.

در مقابل داده های عددی .

برای اطلاعات بیشتر، کار با داده های طبقه بندی شده را در دوره تصادف یادگیری ماشینی ببینید.

کلاس

#مبانی

دسته ای که یک برچسب می تواند به آن تعلق داشته باشد. به عنوان مثال:

در یک مدل طبقه‌بندی باینری که هرزنامه را شناسایی می‌کند، این دو کلاس ممکن است هرزنامه باشند و نه هرزنامه .
در یک مدل طبقه‌بندی چند طبقه که نژادهای سگ را مشخص می‌کند، کلاس‌ها ممکن است پودل ، بیگل ، پاگ و غیره باشند.

یک مدل طبقه بندی یک کلاس را پیش بینی می کند. در مقابل، یک مدل رگرسیون یک عدد را به جای یک کلاس پیش بینی می کند.

برای اطلاعات بیشتر به طبقه بندی در دوره تصادف یادگیری ماشین مراجعه کنید.

مدل طبقه بندی

#مبانی

مدلی که پیش‌بینی آن یک کلاس است. به عنوان مثال، موارد زیر همه مدل های طبقه بندی هستند:

مدلی که زبان جمله ورودی (فرانسوی؟ اسپانیایی؟ ایتالیایی؟) را پیش بینی می کند.
مدلی که گونه های درختی (افرا؟ بلوط؟ بائوباب؟) را پیش بینی می کند.
مدلی که کلاس مثبت یا منفی را برای یک بیماری خاص پیش بینی می کند.

در مقابل، مدل های رگرسیون اعداد را به جای کلاس ها پیش بینی می کنند.

دو نوع رایج از مدل های طبقه بندی عبارتند از:

طبقه بندی باینری
طبقه بندی چند طبقه

آستانه طبقه بندی

#مبانی

در یک طبقه بندی باینری ، عددی بین 0 و 1 که خروجی خام یک مدل رگرسیون لجستیک را به پیش بینی کلاس مثبت یا منفی تبدیل می کند. توجه داشته باشید که آستانه طبقه بندی مقداری است که یک انسان انتخاب می کند، نه ارزشی که توسط آموزش مدل انتخاب شده است.

یک مدل رگرسیون لجستیک یک مقدار خام بین 0 و 1 خروجی می دهد. سپس:

اگر این مقدار خام بیشتر از آستانه طبقه بندی باشد، کلاس مثبت پیش بینی می شود.
اگر این مقدار خام کمتر از آستانه طبقه بندی باشد، کلاس منفی پیش بینی می شود.

به عنوان مثال، فرض کنید آستانه طبقه بندی 0.8 باشد. اگر مقدار خام 0.9 باشد، مدل کلاس مثبت را پیش بینی می کند. اگر مقدار خام 0.7 باشد، مدل کلاس منفی را پیش بینی می کند.

انتخاب آستانه طبقه بندی به شدت بر تعداد مثبت کاذب و منفی کاذب تأثیر می گذارد.

برای یادداشت های اضافی روی نماد کلیک کنید.

همانطور که مدل ها یا مجموعه داده ها تکامل می یابند، مهندسان گاهی اوقات آستانه طبقه بندی را نیز تغییر می دهند. وقتی آستانه طبقه‌بندی تغییر می‌کند، پیش‌بینی‌های کلاس مثبت می‌توانند ناگهان به کلاس‌های منفی تبدیل شوند و بالعکس.

به عنوان مثال، یک مدل پیش‌بینی بیماری طبقه‌بندی باینری را در نظر بگیرید. فرض کنید وقتی سیستم در سال اول اجرا می شود:

مقدار خام برای یک بیمار خاص 0.95 است.
آستانه طبقه بندی 0.94 است.

بنابراین، سیستم طبقه مثبت را تشخیص می دهد. (بیمار نفس نفس می زند: "اوه نه! من مریض هستم!")

یک سال بعد، شاید اکنون مقادیر به شرح زیر باشد:

مقدار خام برای همان بیمار 0.95 باقی می ماند.
آستانه طبقه بندی به 0.97 تغییر می کند.

بنابراین، سیستم اکنون آن بیمار را به عنوان طبقه منفی طبقه بندی می کند. ("روزت مبارک! من مریض نیستم.") همان بیمار. تشخیص متفاوت

برای اطلاعات بیشتر ، آستانه‌ها و ماتریس سردرگمی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

طبقه بندی کننده

#مبانی

یک اصطلاح معمولی برای یک مدل طبقه بندی .

مجموعه داده های کلاس نامتعادل

#مبانی

مجموعه داده ای برای یک مسئله طبقه بندی که در آن تعداد کل برچسب های هر کلاس به طور قابل توجهی متفاوت است. به عنوان مثال، یک مجموعه داده طبقه بندی باینری را در نظر بگیرید که دو برچسب آن به صورت زیر تقسیم می شوند:

1,000,000 برچسب منفی
10 برچسب مثبت

نسبت برچسب های منفی به مثبت 100000 به 1 است، بنابراین این یک مجموعه داده با کلاس نامتعادل است.

در مقابل، مجموعه داده زیر از نظر کلاس نامتعادل نیست زیرا نسبت برچسب های منفی به برچسب های مثبت نسبتا نزدیک به 1 است:

517 برچسب منفی
483 برچسب مثبت

مجموعه داده‌های چند کلاسه نیز می‌توانند دارای عدم تعادل کلاسی باشند. به عنوان مثال، مجموعه داده طبقه‌بندی چند کلاسه زیر نیز از نظر کلاس نامتعادل است، زیرا یک برچسب نمونه‌های بسیار بیشتری نسبت به دو برچسب دیگر دارد:

1,000,000 برچسب با کلاس "سبز"
200 برچسب با کلاس "بنفش"
350 برچسب با کلاس "نارنجی"

همچنین به آنتروپی ، کلاس اکثریت و کلاس اقلیت مراجعه کنید.

بریدن

#مبانی

تکنیکی برای رسیدگی به موارد پرت با انجام یکی یا هر دو مورد زیر:

کاهش مقادیر ویژگی که بیشتر از یک آستانه حداکثر است تا آن آستانه حداکثر.
افزایش مقادیر ویژگی که کمتر از یک آستانه حداقل تا آن آستانه حداقل است.

برای مثال، فرض کنید که <0.5٪ از مقادیر یک ویژگی خاص خارج از محدوده 40-60 باشد. در این صورت می توانید کارهای زیر را انجام دهید:

تمام مقادیر بالای 60 (حداکثر آستانه) را دقیقاً 60 کنید.
تمام مقادیر زیر 40 (حداقل آستانه) را دقیقاً 40 کنید.

پرت ها می توانند به مدل ها آسیب برسانند و گاهی اوقات باعث سرریز وزنه ها در طول تمرین می شوند. برخی از نقاط پرت نیز می توانند به طور چشمگیری معیارهایی مانند دقت را خراب کنند. برش یک تکنیک رایج برای محدود کردن آسیب است.

برش گرادیان مقادیر گرادیان را در یک محدوده تعیین شده در طول تمرین مجبور می کند.

برای اطلاعات بیشتر به داده های عددی: عادی سازی در دوره تصادف یادگیری ماشین مراجعه کنید.

ماتریس سردرگمی

#مبانی

یک جدول NxN که تعداد پیش‌بینی‌های صحیح و نادرست را که یک مدل طبقه‌بندی انجام داده است، خلاصه می‌کند. به عنوان مثال، ماتریس سردرگمی زیر را برای یک مدل طبقه بندی باینری در نظر بگیرید:

	تومور (پیش بینی شده)	غیر توموری (پیش بینی شده)
تومور (حقیقت زمینی)	18 (TP)	1 (FN)
غیر تومور (حقیقت زمینی)	6 (FP)	452 (TN)

ماتریس سردرگمی قبلی موارد زیر را نشان می دهد:

از 19 پیش‌بینی که در آنها حقیقت پایه تومور بود، مدل 18 را به درستی و 1 را به اشتباه طبقه‌بندی کرد.
از 458 پیش‌بینی که در آنها حقیقت پایه غیرتوموری بود، مدل 452 را به درستی و 6 را به اشتباه طبقه‌بندی کرد.

ماتریس سردرگمی برای یک مسئله طبقه بندی چند طبقه می تواند به شما در شناسایی الگوهای اشتباه کمک کند. به عنوان مثال، ماتریس سردرگمی زیر را برای یک مدل طبقه‌بندی چند کلاسه سه کلاسه در نظر بگیرید که سه نوع عنبیه مختلف (ویرجینیکا، ورسیکالر و ستوزا) را دسته‌بندی می‌کند. زمانی که حقیقت اصلی ویرجینیکا بود، ماتریس سردرگمی نشان می‌دهد که این مدل به احتمال زیاد Versicolor را به اشتباه پیش‌بینی می‌کرد تا Setosa:

	ستوزا (پیش بینی شده)	Versicolor (پیش‌بینی شده)	ویرجینیکا (پیش بینی شده)
ستوسا (حقیقت زمینی)	88	12	0
Versicolor (حقیقت زمینی)	6	141	7
ویرجینیکا (حقیقت زمینی)	2	27	109

به عنوان مثال دیگری، یک ماتریس سردرگمی می‌تواند نشان دهد که مدلی که برای تشخیص ارقام دست‌نویس آموزش دیده است، به اشتباه 9 را به جای 4 پیش‌بینی می‌کند، یا به اشتباه 1 را به جای 7 پیش‌بینی می‌کند.

ماتریس های سردرگمی حاوی اطلاعات کافی برای محاسبه انواع معیارهای عملکرد، از جمله دقت و یادآوری هستند .

ویژگی پیوسته

#مبانی

یک ویژگی ممیز شناور با دامنه نامتناهی از مقادیر ممکن، مانند دما یا وزن.

کنتراست با ویژگی گسسته .

همگرایی

#مبانی

حالتی به دست می آید که مقادیر زیان با هر تکرار خیلی کم یا اصلاً تغییر نمی کند. به عنوان مثال، منحنی ضرر زیر همگرایی را در حدود 700 تکرار نشان می دهد:

طرح دکارتی. محور X از دست دادن است. محور Y تعداد تمرینات است تکرارها ضرر در چند تکرار اول بسیار زیاد است، اما به شدت سقوط می کند. پس از حدود 100 تکرار، ضرر همچنان باقی است نزولی اما به مراتب تدریجی تر. پس از حدود 700 بار تکرار، ضرر ثابت می ماند

یک مدل زمانی همگرا می شود که آموزش اضافی مدل را بهبود نبخشد.

در یادگیری عمیق ، مقادیر از دست دادن گاهی اوقات ثابت می ماند یا تقریباً برای بسیاری از تکرارها قبل از اینکه در نهایت کاهش یابد، ثابت می ماند. در طول یک دوره طولانی مقادیر ثابت از دست دادن، ممکن است به طور موقت احساس کاذب همگرایی داشته باشید.

توقف زودهنگام را نیز ببینید.

برای اطلاعات بیشتر ، منحنی‌های همگرایی و تلفات مدل را در دوره تصادف یادگیری ماشینی ببینید.

D

DataFrame

#مبانی

یک نوع داده محبوب پانداها برای نمایش مجموعه داده ها در حافظه.

یک DataFrame مشابه یک جدول یا یک صفحه گسترده است. هر ستون از یک DataFrame یک نام (یک سرصفحه) دارد و هر ردیف با یک عدد منحصر به فرد مشخص می شود.

هر ستون در یک DataFrame مانند یک آرایه دو بعدی ساختار یافته است، با این تفاوت که به هر ستون می توان نوع داده خاص خود را اختصاص داد.

همچنین به صفحه مرجع رسمی pandas.DataFrame مراجعه کنید.

مجموعه داده یا مجموعه داده

#مبانی

مجموعه ای از داده های خام، معمولا (اما نه منحصرا) در یکی از قالب های زیر سازماندهی شده است:

یک صفحه گسترده
یک فایل با فرمت CSV (مقادیر جدا شده با کاما).

مدل عمیق

#مبانی

یک شبکه عصبی حاوی بیش از یک لایه پنهان .

یک مدل عمیق، شبکه عصبی عمیق نیز نامیده می شود.

کنتراست با مدل عریض .

ویژگی متراکم

#مبانی

ویژگی که در آن اکثر یا همه مقادیر غیر صفر هستند، معمولاً تانسوری از مقادیر ممیز شناور است. به عنوان مثال، تانسور 10 عنصری زیر چگال است زیرا 9 مقدار آن غیر صفر است:

کنتراست با ویژگی پراکنده .

عمق

#مبانی

مجموع موارد زیر در یک شبکه عصبی :

تعداد لایه های پنهان
تعداد لایه های خروجی که معمولاً 1 است
تعداد لایه های تعبیه شده

به عنوان مثال، یک شبکه عصبی با پنج لایه پنهان و یک لایه خروجی دارای عمق 6 است.

توجه داشته باشید که لایه ورودی بر عمق تأثیر نمی گذارد.

ویژگی گسسته

#مبانی

ویژگی با مجموعه محدودی از مقادیر ممکن. برای مثال، یک ویژگی که مقادیر آن ممکن است فقط حیوانی ، گیاهی یا معدنی باشد، یک ویژگی گسسته (یا طبقه‌بندی) است.

کنتراست با ویژگی پیوسته .

پویا

#مبانی

کاری که به طور مکرر یا مداوم انجام می شود. اصطلاحات پویا و آنلاین در یادگیری ماشین مترادف هستند. موارد زیر کاربردهای رایج پویا و آنلاین در یادگیری ماشینی است:

مدل پویا (یا مدل آنلاین ) مدلی است که به طور مکرر یا پیوسته بازآموزی می شود.
آموزش پویا (یا آموزش آنلاین ) فرآیند آموزش مکرر یا مداوم است.
استنتاج پویا (یا استنتاج آنلاین ) فرآیند تولید پیش‌بینی‌ها بر حسب تقاضا است.

مدل پویا

#مبانی

مدلی که به طور مکرر (شاید حتی به طور مداوم) بازآموزی می شود. یک مدل پویا یک "یادگیرنده مادام العمر" است که دائماً با داده های در حال تکامل سازگار می شود. یک مدل پویا به عنوان مدل آنلاین نیز شناخته می شود.

کنتراست با مدل استاتیک .

E

توقف زودهنگام

#مبانی

روشی برای منظم‌سازی که شامل پایان دادن به تمرین قبل از کاهش افت تمرین است. در توقف اولیه، زمانی که از دست دادن مجموعه داده اعتبارسنجی شروع به افزایش می‌کند، عمداً آموزش مدل را متوقف می‌کنید. یعنی زمانی که عملکرد تعمیم بدتر می شود.

برای یادداشت های اضافی روی نماد کلیک کنید.

توقف زودهنگام ممکن است خلاف واقع به نظر برسد. به هر حال، گفتن به یک مدل برای توقف تمرین در حالی که ضرر هنوز در حال کاهش است ممکن است به نظر به نظر برسد که به سرآشپز بگویید قبل از پختن کامل دسر، پختن را متوقف کند. با این حال، آموزش یک مدل برای مدت طولانی می تواند منجر به بیش از حد برازش شود. یعنی اگر یک مدل را بیش از حد طولانی آموزش دهید، ممکن است مدل آنقدر با داده های آموزشی مطابقت داشته باشد که مدل پیش بینی خوبی در نمونه های جدید انجام ندهد.

لایه جاسازی

#زبان

#مبانی

یک لایه مخفی ویژه که بر روی یک ویژگی طبقه بندی با ابعاد بالا آموزش می دهد تا به تدریج بردار تعبیه ابعاد پایین تر را یاد بگیرد. یک لایه جاسازی شبکه عصبی را قادر می‌سازد تا بسیار کارآمدتر از آموزش فقط بر روی ویژگی طبقه‌بندی با ابعاد بالا آموزش ببیند.

برای مثال، زمین در حال حاضر از حدود 73000 گونه درختی پشتیبانی می کند. فرض کنید گونه درختی یک ویژگی در مدل شما باشد، بنابراین لایه ورودی مدل شما شامل یک بردار یک داغ به طول 73000 عنصر است. برای مثال، شاید baobab چیزی شبیه به این نشان داده شود:

آرایه ای از 73000 عنصر. 6232 عنصر اول مقدار را حفظ می کنند 0. عنصر بعدی مقدار 1 را دارد. 66767 عنصر نهایی باقی می مانند مقدار صفر

یک آرایه 73000 عنصری بسیار طولانی است. اگر یک لایه جاسازی به مدل اضافه نکنید، به دلیل ضرب 72999 صفر، آموزش بسیار وقت گیر خواهد بود. شاید لایه جاسازی را از 12 بعد انتخاب کنید. در نتیجه، لایه جاسازی به تدریج یک بردار تعبیه جدید برای هر گونه درختی را یاد می گیرد.

در شرایط خاص، هش جایگزین معقولی برای لایه جاسازی است.

برای اطلاعات بیشتر، به دوره آموزشی تصادفی آموزش ماشینی (Embeddings in Machine Learning) مراجعه کنید.

دوران

#مبانی

یک پاس آموزشی کامل در کل مجموعه آموزشی به طوری که هر نمونه یک بار پردازش شده است.

یک دوره نشان دهنده تکرارهای آموزشی اندازه N / دسته ای است که در آن N تعداد کل نمونه ها است.

به عنوان مثال، موارد زیر را فرض کنید:

مجموعه داده شامل 1000 نمونه است.
اندازه دسته 50 نمونه است.

بنابراین، یک دوره واحد نیاز به 20 تکرار دارد:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

برای اطلاعات بیشتر به رگرسیون خطی: Hyperparameters in Machine Learning Crash Course مراجعه کنید.

مثال

#مبانی

مقادیر یک ردیف از ویژگی ها و احتمالاً یک برچسب . نمونه هایی در یادگیری تحت نظارت به دو دسته کلی تقسیم می شوند:

یک مثال برچسب گذاری شده از یک یا چند ویژگی و یک برچسب تشکیل شده است. در طول آموزش از نمونه های برچسب دار استفاده می شود.
یک مثال بدون برچسب شامل یک یا چند ویژگی است اما بدون برچسب. در طول استنتاج از نمونه های بدون برچسب استفاده می شود.

به عنوان مثال، فرض کنید در حال آموزش مدلی برای تعیین تأثیر شرایط آب و هوایی بر نمرات آزمون دانش آموزان هستید. در اینجا سه نمونه برچسب گذاری شده وجود دارد:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	خوب
19	34	1020	عالی
18	92	1012	بیچاره

در اینجا سه نمونه بدون برچسب آورده شده است:

دما	رطوبت	فشار
12	62	1014
21	47	1017
19	41	1021

ردیف یک مجموعه داده معمولاً منبع خام برای مثال است. یعنی یک مثال معمولاً از زیر مجموعه ای از ستون های مجموعه داده تشکیل شده است. علاوه بر این، ویژگی‌های یک مثال می‌تواند شامل ویژگی‌های مصنوعی ، مانند تلاقی ویژگی‌ها نیز باشد.

برای اطلاعات بیشتر، آموزش تحت نظارت را در دوره مقدماتی یادگیری ماشین ببینید.

اف

منفی کاذب (FN)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس منفی را پیش بینی می کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه نیست (کلاس منفی)، اما آن پیام ایمیل در واقع هرزنامه است .

مثبت کاذب (FP)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس مثبت را پیش بینی می کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه است (کلاس مثبت)، اما آن پیام ایمیل در واقع هرزنامه نیست .

برای اطلاعات بیشتر ، آستانه‌ها و ماتریس سردرگمی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

نرخ مثبت کاذب (FPR)

#مبانی

#متریک

نسبت مثال‌های منفی واقعی که مدل به اشتباه کلاس مثبت را پیش‌بینی کرده است. فرمول زیر نرخ مثبت کاذب را محاسبه می کند:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

نرخ مثبت کاذب، محور x در منحنی ROC است.

برای اطلاعات بیشتر به طبقه بندی: ROC و AUC در دوره تصادف یادگیری ماشینی مراجعه کنید.

ویژگی

#مبانی

یک متغیر ورودی به یک مدل یادگیری ماشینی یک مثال از یک یا چند ویژگی تشکیل شده است. به عنوان مثال، فرض کنید در حال آموزش مدلی برای تعیین تأثیر شرایط آب و هوایی بر نمرات آزمون دانش آموزان هستید. جدول زیر سه نمونه را نشان می دهد که هر کدام شامل سه ویژگی و یک برچسب است:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	92
19	34	1020	84
18	92	1012	87

کنتراست با برچسب

برای اطلاعات بیشتر، آموزش تحت نظارت را در دوره مقدماتی یادگیری ماشین ببینید.

متقاطع ویژگی

#مبانی

یک ویژگی مصنوعی که با "تقاطع" ویژگی های طبقه بندی شده یا سطلی شکل می گیرد.

به عنوان مثال، یک مدل "پیش بینی خلق و خو" را در نظر بگیرید که دما را در یکی از چهار سطل زیر نشان می دهد:

freezing
chilly
temperate
warm

و سرعت باد را در یکی از سه سطل زیر نشان می دهد:

still
light
windy

بدون تلاقی ویژگی ها، مدل خطی به طور مستقل در هر یک از هفت سطل مختلف قبلی تمرین می کند. بنابراین، این مدل به عنوان مثال، مستقل از آموزش، به عنوان مثال، در windy freezing تمرین می کند.

از طرف دیگر، می توانید یک تلاقی ویژگی از دما و سرعت باد ایجاد کنید. این ویژگی مصنوعی دارای 12 مقدار ممکن زیر است:

freezing-still
freezing-light
freezing-windy
chilly-still
chilly-light
chilly-windy
temperate-still
temperate-light
temperate-windy
warm-still
warm-light
warm-windy

به لطف ویژگی‌های ضربدری، این مدل می‌تواند تفاوت‌های خلقی را بین یک روز freezing-windy و یک روز freezing-still بیاموزد.

اگر یک ویژگی مصنوعی از دو ویژگی ایجاد کنید که هر کدام دارای سطل های مختلف هستند، ویژگی متقاطع حاصل تعداد زیادی ترکیب ممکن خواهد داشت. به عنوان مثال، اگر یک ویژگی دارای 1000 سطل و ویژگی دیگر دارای 2000 سطل باشد، متقاطع ویژگی حاصل دارای 2،000،000 سطل است.

به طور رسمی، صلیب یک محصول دکارتی است.

تلاقی ویژگی ها بیشتر با مدل های خطی استفاده می شود و به ندرت برای شبکه های عصبی استفاده می شود.

برای اطلاعات بیشتر، داده‌های دسته‌بندی: تلاقی ویژگی‌ها را در دوره تصادف یادگیری ماشینی ببینید.

مهندسی ویژگی

#مبانی

#TensorFlow

فرآیندی که شامل مراحل زیر است:

تعیین اینکه کدام ویژگی ممکن است در آموزش یک مدل مفید باشد.
تبدیل داده های خام از مجموعه داده به نسخه های کارآمد آن ویژگی ها.

برای مثال، ممکن است تعیین کنید که temperature ممکن است یک ویژگی مفید باشد. سپس، می‌توانید با سطل‌سازی آزمایش کنید تا آنچه را که مدل می‌تواند از محدوده‌های temperature مختلف بیاموزد، بهینه کنید.

مهندسی ویژگی گاهی اوقات استخراج ویژگی یا ویژگی نامیده می شود.

برای یادداشت های اضافی در مورد TensorFlow روی نماد کلیک کنید.

در TensorFlow ، مهندسی ویژگی اغلب به معنای تبدیل ورودی های پرونده ورود به سیستم خام به بافرهای پروتکل TF.example است. همچنین به tf.transform مراجعه کنید.

به داده های عددی مراجعه کنید: چگونه یک مدل برای اطلاعات بیشتر داده ها را با استفاده از بردارهای ویژگی در دوره Crash Learning Machine می گذارد .

مجموعه ویژگی

#فونداستال ها

گروه از ویژگی های مدل یادگیری ماشین شما در آن قرار دارد. به عنوان مثال ، یک ویژگی ساده برای مدلی که قیمت مسکن را پیش بینی می کند ممکن است از کد پستی ، اندازه خاصیت و شرایط خاصیت تشکیل شود.

بردار ویژگی

#فونداستال ها

مجموعه مقادیر ویژگی شامل یک مثال . بردار ویژگی در حین آموزش و در حین استنباط ورودی است. به عنوان مثال ، بردار ویژگی برای یک مدل با دو ویژگی گسسته ممکن است:

[0.92, 0.56]

چهار لایه: یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی. لایه ورودی شامل دو گره است ، یکی حاوی مقدار 0.92 و دیگری حاوی مقدار 0.56.

هر مثال مقادیر مختلفی را برای بردار ویژگی فراهم می کند ، بنابراین بردار ویژگی برای مثال بعدی می تواند چیزی شبیه باشد:

[0.73, 0.49]

مهندسی ویژگی نحوه نمایش ویژگی ها در بردار ویژگی را تعیین می کند. به عنوان مثال ، یک ویژگی طبقه بندی باینری با پنج مقدار ممکن ممکن است با رمزگذاری یک داغ نشان داده شود. در این حالت ، بخشی از بردار ویژگی برای یک مثال خاص شامل چهار صفر و یک موقعیت 1.0 در موقعیت سوم است ، به شرح زیر:

[0.0, 0.0, 1.0, 0.0, 0.0]

به عنوان نمونه دیگر ، فرض کنید مدل شما از سه ویژگی تشکیل شده است:

یک ویژگی طبقه بندی باینری با پنج مقدار ممکن که با رمزگذاری یک داغ نشان داده شده است. به عنوان مثال: [0.0, 1.0, 0.0, 0.0, 0.0]
یکی دیگر از ویژگی های طبقه بندی باینری با سه مقدار ممکن که با رمزگذاری یک داغ نشان داده شده است. به عنوان مثال: [0.0, 0.0, 1.0]
یک ویژگی نقطه شناور ؛ به عنوان مثال: 8.3 .

در این حالت ، بردار ویژگی برای هر مثال توسط نه مقدار نشان داده می شود. با توجه به مقادیر مثال در لیست قبلی ، بردار ویژگی:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

حلقه بازخورد

#فونداستال ها

در یادگیری ماشین ، وضعیتی که پیش بینی های یک مدل بر داده های آموزش برای همان مدل یا مدل دیگر تأثیر می گذارد. به عنوان مثال ، مدلی که فیلم ها را توصیه می کند ، فیلم هایی را که افراد می بینند تأثیر می گذارد ، که در این صورت بر مدل های توصیه فیلم بعدی تأثیر می گذارد.

برای اطلاعات بیشتر به سیستم ML Systems: سؤالاتی که باید در دوره تصادف یادگیری ماشین بپرسید ، مراجعه کنید.

جی

تعمیم

#فونداستال ها

توانایی یک مدل در پیش بینی های صحیح در مورد داده های جدید و قبلاً دیده نشده است. مدلی که می تواند تعمیم دهد ، برعکس مدلی است که بیش از حد مناسب است.

برای یادداشت های اضافی روی نماد کلیک کنید.

شما یک مدل را بر روی مثالهای موجود در مجموعه آموزش آموزش می دهید. در نتیجه ، مدل خصوصیات داده های موجود در مجموعه آموزش را می آموزد. تعمیم در اصل می پرسد که آیا مدل شما می تواند پیش بینی های خوبی را در نمونه هایی که در مجموعه آموزش نیستند ، پیش بینی کند.

برای تشویق تعمیم ، منظم سازی به یک مدل کمک می کند تا دقیقاً به خصوصیات داده های موجود در مجموعه آموزش آموزش دهد.

برای کسب اطلاعات بیشتر به تعمیم در دوره Crash Learning Machine مراجعه کنید.

منحنی تعمیم

#فونداستال ها

یک طرح از دست دادن آموزش و از دست دادن اعتبارسنجی به عنوان تابعی از تعداد تکرارها .

یک منحنی تعمیم می تواند به شما در تشخیص بیش از حد احتمالی کمک کند. به عنوان مثال ، منحنی عمومی سازی زیر حاکی از افزایش بیش از حد است زیرا در نهایت از دست دادن اعتبار سنجی به طور قابل توجهی بالاتر از دست دادن آموزش است.

نمودار دکارتی که در آن محور y دارای برچسب از دست دادن و محور x است با عنوان تکرارهای برچسب خورده است. دو قطعه ظاهر می شوند. یک قطعه نشان می دهد از دست دادن آموزش و دیگری ضرر اعتبار سنجی را نشان می دهد. این دو قطعه به طور مشابه شروع می شوند ، اما در نهایت از دست دادن تمرین افتادگی به مراتب پایین تر از ضرر اعتبار سنجی.

برای کسب اطلاعات بیشتر به تعمیم در دوره Crash Learning Machine مراجعه کنید.

شیب نزول

#فونداستال ها

یک تکنیک ریاضی برای به حداقل رساندن ضرر . نزول شیب به طور تکراری وزن و تعصب را تنظیم می کند ، به تدریج بهترین ترکیب را برای به حداقل رساندن از دست دادن پیدا می کند.

نزول شیب قدیمی تر از یادگیری ماشین بسیار قدیمی تر است.

برای کسب اطلاعات بیشتر به رگرسیون خطی: نزول شیب در دوره تصادف Learning Machine مراجعه کنید.

حقیقت زمین

#فونداستال ها

واقعیت.

اتفاقی که در واقع اتفاق افتاد.

به عنوان مثال ، یک مدل طبقه بندی باینری را در نظر بگیرید که پیش بینی می کند که آیا دانشجویی در سال اول دانشگاه خود طی شش سال فارغ التحصیل خواهد شد. حقیقت زمینی برای این مدل این است که آیا دانش آموز در واقع طی شش سال فارغ التحصیل شده است یا خیر.

برای یادداشت های اضافی روی نماد کلیک کنید.

ما کیفیت مدل را در برابر حقیقت زمین ارزیابی می کنیم. با این حال ، حقیقت زمین همیشه کاملاً ، خوب ، راستگو نیست. به عنوان مثال ، نمونه های زیر از نواقص بالقوه در حقیقت را در نظر بگیرید:

در مثال فارغ التحصیلی ، آیا ما مطمئن هستیم که سوابق فارغ التحصیلی برای هر دانش آموز همیشه صحیح است؟ آیا سوابق دانشگاه بی عیب و نقص است؟
فرض کنید برچسب یک مقدار نقطه شناور است که توسط ابزارها (به عنوان مثال ، فشارسنج) اندازه گیری می شود. چگونه می توانیم مطمئن باشیم که هر ساز به طور یکسان کالیبره می شود یا اینکه هر خواندن در همان شرایط گرفته شده است؟
اگر این برچسب یک موضوع از نظر انسان است ، چگونه می توانیم مطمئن باشیم که هر یک از انسان ها وقایع را به همان روش ارزیابی می کنند؟ برای بهبود قوام ، رأی دهندگان متخصص انسان گاهی مداخله می کنند.

اچ

لایه پنهان

#فونداستال ها

یک لایه در یک شبکه عصبی بین لایه ورودی (ویژگی ها) و لایه خروجی (پیش بینی). هر لایه پنهان از یک یا چند نورون تشکیل شده است. به عنوان مثال ، شبکه عصبی زیر شامل دو لایه پنهان است ، اول با سه نورون و دوم با دو نورون:

یک شبکه عصبی عمیق حاوی بیش از یک لایه پنهان است. به عنوان مثال ، تصویر قبلی یک شبکه عصبی عمیق است زیرا این مدل حاوی دو لایه پنهان است.

برای اطلاعات بیشتر به شبکه های عصبی: گره ها و لایه های پنهان در دوره Crash Learning Machine مراجعه کنید.

هایپرپارامتر

#فونداستال ها

متغیرهایی که شما یا یک سرویس تنظیم Hyperparameter هستیددر طول دوره های پی در پی آموزش یک مدل تنظیم کنید. به عنوان مثال ، میزان یادگیری یک هیپرپارامتر است. می توانید قبل از یک جلسه آموزشی ، نرخ یادگیری را روی 0.01 تنظیم کنید. اگر تعیین کنید که 0.01 خیلی زیاد است ، شاید می توانید نرخ یادگیری را برای جلسه آموزشی بعدی 0.003 تعیین کنید.

در مقابل ، پارامترها وزن و تعصب مختلفی هستند که مدل در طول آموزش می آموزد .

برای اطلاعات بیشتر به رگرسیون خطی مراجعه کنید: HyperParameters در دوره Crash Learning Machine.

من

به طور مستقل و یکسان توزیع شده (IID)

#فونداستال ها

داده های حاصل از توزیع که تغییر نمی کند ، و جایی که هر مقدار ترسیم شده به مقادیری که قبلاً ترسیم شده اند بستگی ندارد. IID گاز ایده آل یادگیری ماشین است - یک ساختار ریاضی مفید اما تقریباً هرگز در دنیای واقعی یافت نمی شود. به عنوان مثال ، توزیع بازدید کنندگان به یک صفحه وب ممکن است در یک پنجره کوتاه از زمان باشد. یعنی توزیع در طی آن پنجره مختصر تغییر نمی کند و بازدید یک نفر به طور کلی مستقل از بازدید شخص دیگر است. با این حال ، اگر آن پنجره زمان را گسترش دهید ، ممکن است تفاوت های فصلی در بازدید کنندگان صفحه وب ظاهر شود.

همچنین به غیر استیجت مراجعه کنید.

استنتاج

#فونداستال ها

در یادگیری ماشین ، فرایند پیش بینی با استفاده از یک مدل آموزش دیده در نمونه های بدون برچسب .

استنتاج در آمار معنای کمی متفاوت دارد. برای جزئیات بیشتر به مقاله ویکی پدیا در مورد استنتاج آماری مراجعه کنید.

برای دیدن نقش استنتاج در یک سیستم یادگیری تحت نظارت ، یادگیری تحت نظارت را در دوره معرفی به ML مشاهده کنید.

لایه ورودی

#فونداستال ها

لایه یک شبکه عصبی که دارای بردار ویژگی است. یعنی لایه ورودی نمونه هایی را برای آموزش یا استنباط ارائه می دهد. به عنوان مثال ، لایه ورودی در شبکه عصبی زیر از دو ویژگی تشکیل شده است:

چهار لایه: یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی.

تفسیر پذیری

#فونداستال ها

توانایی توضیح یا ارائه استدلال یک مدل ML به صورت قابل درک به یک انسان.

به عنوان مثال ، بیشتر مدل های رگرسیون خطی بسیار قابل تفسیر هستند. (شما فقط باید به وزن های آموزش دیده برای هر ویژگی نگاه کنید.) جنگل های تصمیم گیری نیز بسیار قابل تفسیر هستند. با این حال ، برخی از مدل ها برای تفسیر قابل تفسیر نیاز به تجسم پیچیده دارند.

برای تفسیر مدل های ML می توانید از ابزار تفسیر یادگیری (LIT) استفاده کنید.

تکرار

#فونداستال ها

یک به روزرسانی واحد از پارامترهای یک مدل - وزن و تعصب مدل - آموزش . اندازه دسته ای تعیین می کند که چند نمونه از مدل در یک تکرار واحد فرآیند می کند. به عنوان مثال ، اگر اندازه دسته ای 20 باشد ، مدل قبل از تنظیم پارامترها 20 نمونه را پردازش می کند.

هنگام آموزش یک شبکه عصبی ، یک تکرار واحد شامل دو پاس زیر است:

یک پاس رو به جلو برای ارزیابی ضرر در یک دسته واحد.
یک پاس به عقب ( backpropagation ) برای تنظیم پارامترهای مدل بر اساس ضرر و میزان یادگیری.

برای کسب اطلاعات بیشتر به شیب نزول در دوره Crash Learning Machine مراجعه کنید.

L

تنظیم _منظم

#فونداستال ها

نوعی منظم سازی که تعداد کل وزن های غیرزرو را در یک مدل مجازات می کند. به عنوان مثال ، یک مدل با 11 وزن غیرزرو بیش از یک مدل مشابه با 10 وزن غیرزرو مجازات می شود.

تنظیم مجدد L ₀ گاهی اوقات تنظیم L0-NORM نامیده می شود.

برای یادداشت های اضافی روی نماد کلیک کنید.

L ₀ تنظیم مجدد در مدلهای بزرگ غیر عملی است زیرا تنظیم مجدد L ₀ آموزش را به یک مشکل بهینه سازی محدب تبدیل می کند.

L ₁ ضرر

#فونداستال ها

#متناقض

یک تابع از دست دادن که مقدار مطلق تفاوت بین مقادیر برچسب واقعی و مقادیری را که یک مدل پیش بینی می کند محاسبه می کند. به عنوان مثال ، در اینجا محاسبه از دست دادن L ₁ برای دسته ای از پنج مثال آورده شده است:

مقدار واقعی مثال	مقدار پیش بینی شده مدل	مقدار مطلق دلتا
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ ضرر

L ₁ از دست دادن نسبت به Outliers نسبت به L ₂ از دست دادن حساسیت کمتری دارد.

میانگین خطای مطلق میانگین ضرر L ₁ در هر مثال است.

برای دیدن ریاضی رسمی ، روی نماد کلیک کنید.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$ n $ تعداد نمونه ها است.
$ y $ ارزش واقعی برچسب است.
$ \ hat {y} $ مقداری است که مدل برای $ y $ پیش بینی می کند.

برای کسب اطلاعات بیشتر به رگرسیون خطی مراجعه کنید: از دست دادن در دوره تصادف یادگیری ماشین.

l ₁ منظم سازی

#فونداستال ها

نوعی منظم سازی که وزن ها را متناسب با مجموع مقدار مطلق وزنها مجازات می کند. تنظیم منظم L ₁ به هدایت وزن ویژگی های بی ربط یا به سختی مرتبط با دقیقاً 0 کمک می کند. یک ویژگی با وزن 0 به طور موثری از مدل حذف می شود.

کنتراست با تنظیم مجدد L ₂ .

L ₂ ضرر

#فونداستال ها

#متناقض

یک تابع از دست دادن که مربع تفاوت بین مقادیر برچسب واقعی و مقادیری را که یک مدل پیش بینی می کند محاسبه می کند. به عنوان مثال ، در اینجا محاسبه از دست دادن L ₂ برای دسته ای از پنج مثال آورده شده است:

مقدار واقعی مثال	مقدار پیش بینی شده مدل	مربع دلتا
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ ضرر

با توجه به مربع ، L ₂ ضرر تأثیر دور را تقویت می کند. یعنی L ₂ ضرر نسبت به از دست دادن L ₁ نسبت به پیش بینی های بد واکنش نشان می دهد. به عنوان مثال ، ضرر L ₁ برای دسته قبلی 8 خواهد بود و نه 16.

مدل های رگرسیون به طور معمول از L ₂ از دست دادن به عنوان عملکرد از دست دادن استفاده می کنند.

میانگین خطای مربع میانگین از دست دادن L ₂ در هر مثال است. از دست دادن مربع نام دیگری برای از دست دادن L ₂ است.

برای دیدن ریاضی رسمی ، روی نماد کلیک کنید.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$ n $ تعداد نمونه ها است.
$ y $ ارزش واقعی برچسب است.
$ \ hat {y} $ مقداری است که مدل برای $ y $ پیش بینی می کند.

برای کسب اطلاعات بیشتر به رگرسیون لجستیک: از دست دادن و تنظیم در دوره تصادف یادگیری ماشین مراجعه کنید.

تنظیم منظم L ₂

#فونداستال ها

نوعی منظم سازی که وزن را متناسب با مجموع مربع وزن ها مجازات می کند. تنظیم مجدد L ₂ به رانندگی وزنهای دورتر (کسانی که دارای مقادیر منفی مثبت یا پایین هستند) به 0 نزدیک می شود اما کاملاً به 0 نیست . ویژگی هایی با مقادیر بسیار نزدیک به 0 در مدل باقی می مانند اما پیش بینی مدل را بسیار تحت تأثیر قرار نمی دهد.

تنظیم مجدد L ₂ همیشه تعمیم در مدلهای خطی را بهبود می بخشد.

کنتراست با تنظیم مجدد L ₁ .

برای کسب اطلاعات بیشتر به Overfittion مراجعه کنید: تنظیم مجدد L2 در دوره Crash Learning Machine.

برچسب

#فونداستال ها

در یادگیری ماشین تحت نظارت ، بخش "پاسخ" یا "نتیجه" از یک مثال .

هر مثال برچسب شامل یک یا چند ویژگی و یک برچسب است. به عنوان مثال ، در یک مجموعه داده تشخیص هرزنامه ، برچسب احتمالاً یا "هرزنامه" یا "نه هرزنامه" خواهد بود. در یک مجموعه داده بارندگی ، این برچسب ممکن است میزان باران باشد که در طی یک دوره خاص کاهش یافته است.

برای اطلاعات بیشتر به یادگیری نظارت شده در مقدمه یادگیری ماشین مراجعه کنید.

نمونه

#فونداستال ها

نمونه ای که شامل یک یا چند ویژگی و یک برچسب است. به عنوان مثال ، جدول زیر سه نمونه برچسب زده شده از یک مدل ارزیابی خانه را نشان می دهد که هر کدام دارای سه ویژگی و یک برچسب است:

تعداد اتاق خواب	تعداد حمام ها	دوره خانه	قیمت خانه (برچسب)
3	2	15	345000 دلار
2	1	72	179000 دلار
4	2	34	392000 دلار

در یادگیری ماشین تحت نظارت ، مدل ها بر روی نمونه های دارای برچسب آموزش می بینند و در مورد نمونه های بدون برچسب پیش بینی می کنند.

نمونه کنتراست با نمونه های بدون برچسب.

برای اطلاعات بیشتر به یادگیری نظارت شده در مقدمه یادگیری ماشین مراجعه کنید.

لامبدا

#فونداستال ها

مترادف برای نرخ منظم .

لامبدا یک اصطلاح بیش از حد است. در اینجا ما روی تعریف این اصطلاح در تنظیم مجدد تمرکز می کنیم.

لایه

#فونداستال ها

مجموعه ای از نورون ها در یک شبکه عصبی . سه نوع لایه مشترک به شرح زیر است:

لایه ورودی ، که مقادیر همه ویژگی ها را فراهم می کند.
یک یا چند لایه پنهان ، که روابط غیرخطی بین ویژگی ها و برچسب پیدا می کنند.
لایه خروجی ، که پیش بینی را ارائه می دهد.

به عنوان مثال ، تصویر زیر یک شبکه عصبی با یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی را نشان می دهد:

یک شبکه عصبی با یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی لایه ورودی از دو ویژگی تشکیل شده است. اولین لایه پنهان از سه نورون و لایه دوم پنهان تشکیل شده است از دو نورون تشکیل شده است. لایه خروجی از یک گره واحد تشکیل شده است.

در TensorFlow ، لایه ها نیز توابع پایتون هستند که تانسور و گزینه های پیکربندی را به عنوان ورودی می گیرند و تنش های دیگر را به عنوان خروجی تولید می کنند.

میزان یادگیری

#فونداستال ها

یک شماره نقطه شناور که به الگوریتم نزول شیب می گوید چگونه می توان وزن و تعصب را در هر تکرار تنظیم کرد. به عنوان مثال ، میزان یادگیری 0.3 می تواند وزن و تعصب را سه برابر قدرتمندتر از نرخ یادگیری 0.1 تنظیم کند.

میزان یادگیری یک هیپرپارامتر کلیدی است. اگر نرخ یادگیری را خیلی پایین تنظیم کنید ، آموزش بیش از حد طول می کشد. اگر نرخ یادگیری را خیلی زیاد تنظیم کنید ، نزول شیب اغلب در رسیدن به همگرایی مشکل دارد.

برای توضیحات ریاضی بیشتر روی نماد کلیک کنید.

در طول هر تکرار ، الگوریتم نزول شیب نرخ یادگیری را توسط شیب چند برابر می کند. محصول حاصل از مرحله شیب نامیده می شود.

برای اطلاعات بیشتر به رگرسیون خطی مراجعه کنید: HyperParameters در دوره Crash Learning Machine.

خطی

#فونداستال ها

رابطه بین دو یا چند متغیر که فقط از طریق افزودن و ضرب قابل نمایش هستند.

طرح یک رابطه خطی یک خط است.

تضاد با غیرخطی .

مدل خطی

#فونداستال ها

مدلی که یک وزن در هر ویژگی را برای پیش بینی تعیین می کند. (مدل های خطی نیز دارای تعصب هستند.) در مقابل ، رابطه ویژگی ها با پیش بینی در مدل های عمیق به طور کلی غیرخطی است.

مدل های خطی معمولاً آموزش آسانتر و قابل تفسیر از مدل های عمیق هستند. با این حال ، مدل های عمیق می توانند روابط پیچیده ای بین ویژگی ها بیاموزند.

رگرسیون خطی و رگرسیون لجستیک دو نوع مدل خطی است.

برای دیدن ریاضی روی نماد کلیک کنید.

یک مدل خطی از این فرمول پیروی می کند:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

کجا:

y 'پیش بینی خام است. (در انواع خاصی از مدل های خطی ، این پیش بینی خام بیشتر اصلاح می شود. به عنوان مثال ، رگرسیون لجستیک را ببینید.)
B تعصب است.
W یک وزن است ، بنابراین W ₁ وزن اولین ویژگی ، W ₂ وزن ویژگی دوم و غیره است.
X یک ویژگی است ، بنابراین x ₁ مقدار ویژگی اول است ، x ₂ مقدار ویژگی دوم و غیره است.

به عنوان مثال ، فرض کنید یک مدل خطی برای سه ویژگی تعصب و وزن زیر را می آموزد:

b = 7
W ₁ = -2.5
W ₂ = -1.2
W ₃ = 1.4

بنابراین ، با توجه به سه ویژگی (x ₁ ، x ₂ و x ₃ ) ، مدل خطی از معادله زیر برای تولید هر پیش بینی استفاده می کند:

y' = 7 + (-2.5)(x₁) + (-1.2)(x₂) + (1.4)(x₃)

فرض کنید یک مثال خاص حاوی مقادیر زیر است:

x ₁ = 4
x ₂ = -10
x ₃ = 5

وصل کردن این مقادیر به فرمول پیش بینی این مثال را به همراه دارد:

y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)
y' = 16

مدل های خطی نه تنها شامل مدل هایی است که فقط از یک معادله خطی برای پیش بینی استفاده می کنند بلکه مجموعه گسترده تری از مدل ها را نیز از یک معادله خطی به عنوان تنها یک مؤلفه فرمول استفاده می کنند که پیش بینی می کند. به عنوان مثال ، رگرسیون لجستیک پس از فرآیند پیش بینی خام (y)) برای تولید یک مقدار پیش بینی نهایی بین 0 تا 1 ، منحصراً.

رگرسیون خطی

#فونداستال ها

نوعی مدل یادگیری ماشین که در آن هر دو مورد صحیح است:

مدل یک مدل خطی است.
پیش بینی یک مقدار نقطه شناور است. (این بخش رگرسیون رگرسیون خطی است.)

رگرسیون خطی کنتراست با رگرسیون لجستیک . همچنین ، رگرسیون کنتراست با طبقه بندی .

برای کسب اطلاعات بیشتر به رگرسیون خطی در دوره Crash Learning Machine مراجعه کنید.

رگرسیون لجستیک

#فونداستال ها

نوعی مدل رگرسیون که یک احتمال را پیش بینی می کند. مدل های رگرسیون لجستیک ویژگی های زیر را دارند:

برچسب طبقه بندی شده است. اصطلاح رگرسیون لجستیک معمولاً به رگرسیون لجستیک باینری اشاره دارد ، یعنی به مدلی که احتمال را برای برچسب ها با دو مقدار ممکن محاسبه می کند. یک نوع کمتر متداول ، رگرسیون لجستیک چندمجمی ، احتمالات مربوط به برچسب ها را با بیش از دو مقدار ممکن محاسبه می کند.
عملکرد ضرر در طول آموزش از دست دادن ورود به سیستم است. (چند واحد از دست دادن ورود به سیستم را می توان به طور موازی برای برچسب هایی با بیش از دو مقدار ممکن قرار داد.)
این مدل دارای معماری خطی است ، نه یک شبکه عصبی عمیق. با این حال ، باقیمانده این تعریف همچنین در مورد مدل های عمیق که احتمال برچسب های طبقه بندی را پیش بینی می کند ، اعمال می شود.

به عنوان مثال ، یک مدل رگرسیون لجستیک را در نظر بگیرید که احتمال یک ایمیل ورودی یا هرزنامه یا اسپم را محاسبه می کند. در طول استنتاج ، فرض کنید مدل 0.72 را پیش بینی می کند. بنابراین ، مدل تخمین می زند:

72 ٪ شانس ایمیل در هرزنامه.
28 ٪ احتمال عدم اسپم ایمیل.

یک مدل رگرسیون لجستیک از معماری دو مرحله ای زیر استفاده می کند:

این مدل با استفاده از یک تابع خطی از ویژگی های ورودی ، پیش بینی خام (y ') ایجاد می کند.
این مدل از پیش بینی خام به عنوان ورودی به یک عملکرد سیگموئید استفاده می کند ، که پیش بینی خام را به یک مقدار بین 0 تا 1 تبدیل می کند ، منحصر به فرد.

مانند هر مدل رگرسیون ، یک مدل رگرسیون لجستیک تعدادی را پیش بینی می کند. با این حال ، این تعداد به طور معمول بخشی از یک مدل طبقه بندی باینری به شرح زیر می شود:

اگر تعداد پیش بینی شده از آستانه طبقه بندی بیشتر باشد ، مدل طبقه بندی باینری کلاس مثبت را پیش بینی می کند.
اگر تعداد پیش بینی شده کمتر از آستانه طبقه بندی باشد ، مدل طبقه بندی باینری کلاس منفی را پیش بینی می کند.

برای کسب اطلاعات بیشتر به رگرسیون لجستیک در دوره Crash Learning Machine مراجعه کنید.

از دست دادن گزارش

#فونداستال ها

عملکرد از دست دادن مورد استفاده در رگرسیون لجستیک باینری.

برای دیدن ریاضی روی نماد کلیک کنید.

فرمول زیر از دست دادن ورود به سیستم را محاسبه می کند:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

کجا:

$(x,y)\in D$ آیا مجموعه داده حاوی بسیاری از نمونه های دارای برچسب است که عبارتند از $(x,y)$ جفت
$y$ برچسب در یک مثال برچسب است. از آنجا که این رگرسیون لجستیک است ، هر مقدار از $y$ باید یا 0 یا 1 باشد.
$y'$ مقدار پیش بینی شده (در جایی بین 0 تا 1 ، اختصاصی) با توجه به مجموعه ویژگی های موجود در $x$.

برای کسب اطلاعات بیشتر به رگرسیون لجستیک: از دست دادن و تنظیم در دوره تصادف یادگیری ماشین مراجعه کنید.

با ورود به سیستم

#فونداستال ها

لگاریتم شانس برخی از رویدادها.

برای دیدن ریاضی روی نماد کلیک کنید.

اگر این رویداد یک احتمال باینری باشد ، شانس به نسبت احتمال موفقیت ( P ) به احتمال عدم موفقیت (1- P ) اشاره دارد. به عنوان مثال ، فرض کنید که یک رویداد خاص دارای 90 ٪ احتمال موفقیت و 10 ٪ احتمال شکست است. در این حالت ، شانس به شرح زیر محاسبه می شود:

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

ورود به سیستم به سادگی لگاریتم شانس است. طبق کنوانسیون ، "لگاریتم" به لگاریتم طبیعی اشاره دارد ، اما لگاریتم در واقع می تواند هر پایه ای بیشتر از 1. باشد. با چسبیدن به کنوانسیون ، ورود به سیستم نمونه ما از این رو است: بنابراین:

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

عملکرد log-odds معکوس عملکرد سیگموئید است.

از دست دادن

#فونداستال ها

#متناقض

در طول آموزش یک مدل نظارت شده ، معیاری از پیش بینی مدل تا چه اندازه از برچسب آن است.

یک تابع از دست دادن ضرر را محاسبه می کند.

برای کسب اطلاعات بیشتر به رگرسیون خطی مراجعه کنید: از دست دادن در دوره تصادف یادگیری ماشین.

منحنی ضرر

#فونداستال ها

طرح از دست دادن به عنوان تابعی از تعداد تکرارهای آموزش. طرح زیر یک منحنی ضرر معمولی را نشان می دهد:

نمودار دکارتی از دست دادن در مقابل تکرارهای آموزش ، نشان دادن یک افت سریع از دست دادن برای تکرارهای اولیه ، و به دنبال آن تدریجی در طول تکرارهای نهایی ، قطره و سپس یک شیب مسطح.

منحنی های از دست دادن می توانند به شما در تعیین زمان همگرا یا بیش از حد مدل کمک کنند.

منحنی های ضرر می توانند تمام انواع زیر را از دست بدهند:

از دست دادن آموزش
از دست دادن اعتبار سنجی
از دست دادن

همچنین به منحنی تعمیم مراجعه کنید.

برای کسب اطلاعات بیشتر ، به بیش از حد مراجعه کنید: تفسیر منحنی های ضرر در دوره تصادف Learning Machine.

عملکرد از دست دادن

#فونداستال ها

#متناقض

در حین آموزش یا آزمایش ، یک عملکرد ریاضی که از بین رفتن در یک دسته از نمونه ها محاسبه می کند. یک عملکرد از دست دادن باعث از بین رفتن کمتر برای مدل هایی می شود که پیش بینی های خوبی را نسبت به مدل هایی که پیش بینی های بدی دارند ، ایجاد می کند.

هدف از آموزش به طور معمول به حداقل رساندن ضرر است که عملکرد ضرر باز می گردد.

انواع مختلفی از توابع ضرر وجود دارد. عملکرد ضرر مناسب را برای نوع مدلی که می سازید انتخاب کنید. به عنوان مثال:

L ₂ از دست دادن (یا میانگین خطای مربع ) عملکرد از دست دادن رگرسیون خطی است.
از دست دادن ورود به سیستم عملکرد ضرر برای رگرسیون لجستیک است.

م

یادگیری ماشینی

#فونداستال ها

برنامه یا سیستمی که یک مدل را از داده های ورودی آموزش می دهد . مدل آموزش دیده می تواند پیش بینی های مفیدی را از داده های جدید (هرگز دیده نشده) که از همان توزیع مشابه مورد استفاده برای آموزش مدل تهیه شده است ، پیش بینی کند.

یادگیری ماشین همچنین به زمینه تحصیلی مربوط به این برنامه ها یا سیستم ها اشاره دارد.

برای اطلاعات بیشتر به مقدمه دوره یادگیری ماشین مراجعه کنید.

طبقه اکثریت

#فونداستال ها

برچسب رایج تر در یک مجموعه داده با کلاس متعادل . به عنوان مثال ، با توجه به یک مجموعه داده حاوی 99 ٪ برچسب منفی و 1 ٪ برچسب های مثبت ، برچسب های منفی کلاس اکثریت هستند.

تضاد با کلاس اقلیت .

برای اطلاعات بیشتر به مجموعه داده ها مراجعه کنید: مجموعه داده های نامتعادل در دوره Crash Learning Machine.

مینی دسته

#فونداستال ها

یک زیر مجموعه کوچک و به طور تصادفی از یک دسته که در یک تکرار پردازش می شود. اندازه دسته ای از یک مینی دسته معمولاً بین 10 تا 1000 نمونه است.

به عنوان مثال ، فرض کنید کل مجموعه آموزش (دسته کامل) شامل 1000 نمونه است. علاوه بر این فرض کنید که اندازه دسته ای از هر مینی دسته را به 20 تنظیم کرده اید. بنابراین ، هر تکرار از دست دادن در 20 نمونه از 1000 نمونه را تعیین می کند و سپس وزن و تعصب را بر این اساس تنظیم می کند.

محاسبه ضرر در یک مینی دسته بسیار کارآمدتر از ضرر در تمام نمونه های موجود در دسته کامل است.

برای اطلاعات بیشتر به رگرسیون خطی مراجعه کنید: HyperParameters در دوره Crash Learning Machine.

طبقه اقلیت

#فونداستال ها

برچسب کمتر متداول در یک مجموعه داده با کلاس متعادل . به عنوان مثال ، با توجه به یک مجموعه داده حاوی 99 ٪ برچسب منفی و 1 ٪ برچسب های مثبت ، برچسب های مثبت کلاس اقلیت هستند.

تضاد با کلاس اکثریت .

برای یادداشت های اضافی روی نماد کلیک کنید.

یک مجموعه آموزشی با یک میلیون نمونه چشمگیر به نظر می رسد. با این حال ، اگر طبقه اقلیت ضعیف باشد ، حتی ممکن است یک مجموعه آموزش بسیار بزرگ نیز کافی نباشد. کمتر روی تعداد کل نمونه ها در مجموعه داده ها و بیشتر روی تعداد مثالهای موجود در کلاس اقلیت تمرکز کنید.

اگر مجموعه داده شما حاوی نمونه های کلاس اقلیت کافی نیست ، برای تکمیل کلاس اقلیت از Downsampling (تعریف در گلوله دوم) استفاده کنید.

برای اطلاعات بیشتر به مجموعه داده ها مراجعه کنید: مجموعه داده های نامتعادل در دوره Crash Learning Machine.

مدل

#فونداستال ها

به طور کلی ، هر ساختاری ریاضی که داده های ورودی را پردازش می کند و خروجی را باز می گرداند. با بیان متفاوت ، یک مدل مجموعه پارامترها و ساختار مورد نیاز برای یک سیستم برای پیش بینی است. در یادگیری ماشین تحت نظارت ، یک مدل به عنوان ورودی مثال می زند و پیش بینی را به عنوان خروجی نشان می دهد. در یادگیری ماشین تحت نظارت ، مدل ها تا حدودی متفاوت هستند. به عنوان مثال:

یک مدل رگرسیون خطی شامل مجموعه ای از وزنه ها و تعصب است.
یک مدل شبکه عصبی شامل موارد زیر است:
- مجموعه ای از لایه های پنهان ، هر یک حاوی یک یا چند نورون .
- وزن و تعصب مرتبط با هر نورون.
یک مدل درخت تصمیم شامل موارد زیر است:
- شکل درخت ؛ یعنی الگویی که در آن شرایط و برگها به هم وصل شده است.
- شرایط و برگها.

می توانید از یک مدل ذخیره ، بازیابی یا تهیه کنید.

یادگیری دستگاه بدون نظارت همچنین مدل هایی را تولید می کند ، به طور معمول تابعی که می تواند یک نمونه ورودی را برای مناسب ترین خوشه ترسیم کند.

برای مقایسه توابع جبر و برنامه نویسی با مدل های ML ، روی نماد کلیک کنید.

یک عملکرد جبری مانند موارد زیر یک مدل است:

  f(x, y) = 3x -5xy + y² + 17

عملکرد قبلی مقادیر ورودی ( x و y ) را برای خروجی نقشه می کند.

به همین ترتیب ، یک عملکرد برنامه نویسی مانند موارد زیر نیز یک مدل است:

def half_of_greater(x, y):
  if (x > y):
    return(x / 2)
  else
    return(y / 2)

یک تماس گیرنده آرگومان ها را به عملکرد Python قبلی منتقل می کند ، و عملکرد پایتون خروجی (از طریق عبارت Return ) ایجاد می کند.

اگرچه یک شبکه عصبی عمیق از ساختار ریاضی بسیار متفاوتی نسبت به یک عملکرد جبری یا برنامه نویسی برخوردار است ، اما یک شبکه عصبی عمیق هنوز ورودی (یک مثال) را می گیرد و خروجی را برمی گرداند (پیش بینی).

یک برنامه نویس انسان یک عملکرد برنامه نویسی را به صورت دستی رمزگذاری می کند. در مقابل ، یک مدل یادگیری ماشین به تدریج پارامترهای بهینه را در طول آموزش خودکار می آموزد.

طبقه بندی چند طبقه

#فونداستال ها

در یادگیری تحت نظارت ، یک مشکل طبقه بندی که در آن مجموعه داده شامل بیش از دو کلاس برچسب است. به عنوان مثال ، برچسب های موجود در مجموعه داده های Iris باید یکی از سه کلاس زیر باشد:

زنبق ستوزا
زنبق ویرجینیکا
زنبق ورسیکالر

مدلی که در مجموعه داده های Iris آموزش داده شده است که نوع IRIS را در نمونه های جدید پیش بینی می کند ، انجام طبقه بندی چند کلاس است.

در مقابل ، مشکلات طبقه بندی که دقیقاً بین دو کلاس تمایز قائل هستند ، مدل های طبقه بندی باینری هستند. به عنوان مثال ، یک مدل ایمیل که اسپم را پیش بینی می کند یا نه هرزنامه یک مدل طبقه بندی باینری است.

در مشکلات خوشه بندی ، طبقه بندی چند طبقه به بیش از دو خوشه اشاره دارد.

برای کسب اطلاعات بیشتر به شبکه های عصبی: طبقه بندی چند طبقه در دوره تصادف یادگیری ماشین مراجعه کنید.

ن

طبقه منفی

#فونداستال ها

#متناقض

در طبقه بندی باینری ، یک کلاس مثبت خوانده می شود و دیگری منفی نامیده می شود. کلاس مثبت چیز یا رویدادی است که مدل در حال آزمایش است و کلاس منفی احتمال دیگر است. به عنوان مثال:

کلاس منفی در یک آزمایش پزشکی ممکن است "تومور" نباشد.
کلاس منفی در یک مدل طبقه بندی ایمیل ممکن است "اسپم" نباشد.

تضاد با کلاس مثبت .

شبکه عصبی

#فونداستال ها

یک مدل حاوی حداقل یک لایه پنهان . یک شبکه عصبی عمیق نوعی از شبکه عصبی است که حاوی بیش از یک لایه پنهان است. به عنوان مثال ، نمودار زیر یک شبکه عصبی عمیق حاوی دو لایه پنهان را نشان می دهد.

یک شبکه عصبی با یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی

هر نورون در یک شبکه عصبی به تمام گره های لایه بعدی متصل می شود. به عنوان مثال ، در نمودار قبلی ، توجه کنید که هر یک از سه نورون در لایه اول پنهان به طور جداگانه به هر دو نورون در لایه پنهان دوم متصل می شوند.

شبکه های عصبی که بر روی رایانه ها اجرا می شوند ، گاهی اوقات شبکه های عصبی مصنوعی نامیده می شوند تا آنها را از شبکه های عصبی موجود در مغز و سایر سیستم های عصبی متمایز کنند.

برخی از شبکه های عصبی می توانند از روابط غیرخطی بسیار پیچیده بین ویژگی های مختلف و برچسب تقلید کنند.

همچنین به شبکه عصبی Convolutional و شبکه عصبی مکرر مراجعه کنید.

برای اطلاعات بیشتر به شبکه های عصبی در دوره Crash Learning Machine مراجعه کنید.

نورون

#فونداستال ها

در یادگیری ماشین ، یک واحد مجزا در یک لایه پنهان از یک شبکه عصبی . هر نورون عملکرد دو مرحله ای زیر را انجام می دهد:

مقدار وزنی مقادیر ورودی را ضرب شده توسط وزن مربوطه آنها محاسبه می کند.
مبلغ وزنی را به عنوان ورودی به یک عملکرد فعال سازی منتقل می کند.

یک نورون در اولین لایه پنهان ورودی های مقادیر ویژگی موجود در لایه ورودی را می پذیرد. یک نورون در هر لایه پنهان فراتر از اولین ، ورودی های نورون ها را در لایه پنهان قبلی می پذیرد. به عنوان مثال ، یک نورون در لایه پنهان دوم ورودی های نورون ها را در لایه اول پنهان می پذیرد.

تصویر زیر دو نورون و ورودی های آنها را برجسته می کند.

یک شبکه عصبی با یک لایه ورودی ، دو لایه پنهان و یک لایه خروجی دو نورون برجسته می شوند: یکی در حالت اول لایه پنهان و یکی در لایه دوم پنهان. برجسته نورون در اولین لایه پنهان ورودی از هر دو ویژگی را دریافت می کند در لایه ورودی نورون برجسته در لایه دوم پنهان ورودی های هر یک از سه نورون را در اولین پنهان دریافت می کند لایه.

یک نورون در یک شبکه عصبی از رفتار نورون ها در مغز و سایر قسمت های سیستم های عصبی تقلید می کند.

گره (شبکه عصبی)

#فونداستال ها

یک نورون در یک لایه پنهان .

برای اطلاعات بیشتر به شبکه های عصبی در دوره Crash Learning Machine مراجعه کنید.

غیر خطی

#فونداستال ها

رابطه بین دو یا چند متغیر که فقط از طریق افزودن و ضرب قابل نمایش نیستند. یک رابطه خطی را می توان به عنوان یک خط نشان داد. یک رابطه غیرخطی نمی تواند به عنوان یک خط ارائه شود. به عنوان مثال ، دو مدل را در نظر بگیرید که هر کدام یک ویژگی واحد را به یک برچسب واحد مرتبط می کنند. مدل در سمت چپ خطی است و مدل در سمت راست غیرخطی است:

دو قطعه یک طرح یک خط است ، بنابراین این یک رابطه خطی است. طرح دیگر منحنی است ، بنابراین این یک رابطه غیرخطی است.

به شبکه های عصبی مراجعه کنید: گره ها و لایه های پنهان در دوره Crash Learning Machine برای آزمایش با انواع مختلف عملکردهای غیرخطی.

غیر ایستاری

#فونداستال ها

ویژگی ای که مقادیر آن در یک یا چند بعد تغییر می کند ، معمولاً زمان. به عنوان مثال ، مثالهای زیر از عدم استحکام را در نظر بگیرید:

تعداد لباس های شنا که در یک فروشگاه خاص فروخته می شود با فصل متفاوت است.
مقدار میوه خاصی که در یک منطقه خاص برداشت می شود برای بیشتر سال صفر است اما برای مدت کوتاهی بزرگ است.
با توجه به تغییرات آب و هوایی ، میانگین دما سالانه در حال تغییر است.

تضاد با ثابت بودن .

عادی سازی

#فونداستال ها

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

-1 to +1
0 به 1
Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of feature engineering , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in feature engineering . Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range.

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

داده های عددی

#fundamentals

Features represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes shouldn't be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

O

آفلاین

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:

"Denmark"
"سوئد"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

کشور	بردار
"Denmark"	1	0	0	0	0
"سوئد"	0	1	0	0	0
"Norway"	0	0	1	0	0
"Finland"	0	0	0	1	0
"Iceland"	0	0	0	0	1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

animal versus not animal
vegetable versus not vegetable
mineral versus not mineral

آنلاین

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Generating predictions on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one لایه خروجی The input layer consists of two features. اولین hidden layer consists of three neurons and the second hidden layer consists of two neurons. The output layer consists of a single node.

بیش از حد

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

Overfitting is like strictly following advice from only your favorite teacher. You'll probably be successful in that teacher's class, but you might "overfit" to that teacher's ideas and be unsuccessful in other classes. Following advice from a mixture of teachers will enable you to adapt better to new situations.

See Overfitting in Machine Learning Crash Course for more information.

پ

پانداها

#fundamentals

A column-oriented data analysis API built on top of numpy . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

پارامتر

#fundamentals

The weights and biases that a model learns during training . For example, in a linear regression model, the parameters consist of the bias ( b ) and all the weights ( w ₁ , w ₂ , and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

The term positive class can be confusing because the "positive" outcome of many tests is often an undesirable result. For example, the positive class in many medical tests corresponds to tumors or diseases. In general, you want a doctor to tell you, "Congratulations! Your test results were negative." Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negative classes.

پس پردازش

#responsible

#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

پیش بینی

#fundamentals

A model's output. به عنوان مثال:

The prediction of a binary classification model is either the positive class or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. یا آنها؟ Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

See Datasets: Labels in Machine Learning Crash Course for more information.

آر

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

ارزیاب

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

واحد خطی اصلاح شده (ReLU)

#fundamentals

An activation function with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

به عنوان مثال:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant
y value of 0, running along the x-axis from -infinity,0 to 0,-0.
The second line starts at 0,0. This line has a slope of +1, so
it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

مدل رگرسیون

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

A model that predicts a certain house's value in Euros, such as 423,000.
A model that predicts a certain tree's life expectancy in years, such as 23.2.
A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

Linear regression , which finds the line that best fits label values to features.
Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

منظم سازی

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

L ₁ regularization
L ₂ regularization
تنظیم ترک تحصیل
early stopping (this is not a formal regularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usually increases training loss, which is confusing because, well, isn't the goal to minimize training loss?

در واقع، نه. The goal isn't to minimize training loss. The goal is to make excellent predictions on real-world examples. Remarkably, even though increasing regularization increases training loss, it usually helps models make better predictions on real-world examples.

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance of regularization during training. Raising the regularization rate reduces overfitting but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda. The following simplified loss equation shows lambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

where regularization is any regularization mechanism, including;

L ₁ regularization
L ₂ regularization

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality of large language model (LLM) output by grounding it with sources of knowledge retrieved after the model was trained. RAG improves the accuracy of LLM responses by providing the trained LLM with access to information retrieved from trusted knowledge bases or documents.

Common motivations to use retrieval-augmented generation include:

Increasing the factual accuracy of a model's generated responses.
Giving the model access to knowledge it was not trained on.
Changing the knowledge that the model uses.
Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

Searches for ("retrieves") data that's relevant to the user's query.
Appends ("augments") the relevant chemistry data to the user's query.
Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The curve has an inverted L shape. منحنی starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The ROC curve approximates a shaky arc
traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

اس

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
-infinity to +positive, while y values span the range almost 0 to
almost 1. When x is 0, y is 0.5. The slope of the curve is always
positive, with the highest slope at 0,0.5 and gradually decreasing
slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

Converting the raw output of a logistic regression or multinomial regression model to a probability.
Acting as an activation function in some neural networks.

Click the icon to see the math.

The sigmoid function over an input number x has the following formula:

$$ sigmoid(x) = \frac{1}{1 + e^{-\text{x}}} $$

In machine learning, x is generally a weighted sum .

سافت مکس

#fundamentals

A function that determines probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a...	احتمال
سگ	.85
گربه	.13
اسب	.02

Softmax is also called full softmax .

Contrast with candidate sampling .

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

کجا:

$\sigma_i$ is the output vector. Each element of the output vector specifies the probability of this element. The sum of all the elements in the output vector is 1.0. The output vector contains the same number of elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector contains a floating-point value.
$K$ is the number of elements in the input vector (and the output vector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. اوه!

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#language

#fundamentals

A feature whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a dense feature has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree . Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with one-hot encoding . If the one-hot encoding is big, you might put an embedding layer on top of the one-hot encoding for greater efficiency.

sparse representation

#language

#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single 1 (to represent the particular tree species in that example) and 35 0 s (to represent the 35 tree species not in that example). So, the one-hot representation of maple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position
24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

Notice that the sparse representation is much more compact than the one-hot representation.

Click the icon for a slightly more complex example.

Suppose each example in your model must represent the words—but not the order of those words—in an English sentence. English consists of about 170,000 words, so English is a categorical feature with about 170,000 elements. Most English sentences use an extremely tiny fraction of those 170,000 words, so the set of words in a single example is almost certainly going to be sparse data.

جمله زیر را در نظر بگیرید:

My dog is a great dog

You could use a variant of one-hot vector to represent the words in this sentence. In this variant, multiple cells in the vector can contain a nonzero value. Furthermore, in this variant, a cell can contain an integer other than one. Although the words "my", "is", "a", and "great" appear only once in the sentence, the word "dog" appears twice. Using this variant of one-hot vectors to represent the words in this sentence yields the following 170,000-element vector:

A sparse representation of the same sentence would simply be:

Click the icon if you are confused.

The term "sparse representation" confuses a lot of people because sparse representation is itself not a sparse vector . Rather, sparse representation is actually a dense representation of a sparse vector . The synonym index representation is a little clearer than "sparse representation."

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

ایستا

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

static model (or offline model ) is a model trained once and then used for a while.
static training (or offline training ) is the process of training a static model.
static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

یادگیری ماشینی تحت نظارت

#fundamentals

Training a model from features and their corresponding labels . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

Bucketing a continuous feature into range bins.
Creating a feature cross .
Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
- ab
- یک ²
Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

تی

test loss

#fundamentals

#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

آموزش

#fundamentals

The process of determining the ideal parameters (weights and biases) comprising a model . During training, a system reads in examples and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

See Supervised Learning in the Introduction to ML course for more information.

از دست دادن آموزش

#fundamentals

#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
with a steep downward slope. The slope gradually flattens until the
slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

مجموعه آموزشی

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

منفی واقعی (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

مثبت واقعی (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

U

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of features .
Training for too few epochs or at too low a learning rate .
Training with too high a regularization rate .
Providing too few hidden layers in a deep neural network.

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

تعداد اتاق خواب	Number of bathrooms	House age
3	2	15
2	1	72
4	2	34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

یادگیری ماشینی بدون نظارت

#clustering

#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning .

Click the icon for additional notes.

Another example of unsupervised machine learning is principal component analysis (PCA) . For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

See What is Machine Learning? in the Introduction to ML course for more information.

V

اعتبار سنجی

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

مجموعه اعتبار سنجی

#fundamentals

The subset of the dataset that performs initial evaluation against a trained model . Typically, you evaluate the trained model against the validation set several times before evaluating the model against the test set .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

دبلیو

وزن

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

Click the icon to see an example of weights in a linear model.

Imagine a linear model with two features. Suppose that training determines the following weights (and bias ):

The bias, b, has a value of 2.2
The weight, w ₁ associated with one feature is 1.5.
The weight, w ₂ associated with the other feature is 0.4.

Now imagine an example with the following feature values:

The value of one feature, x ₁ , is 6.
The value of the other feature, x ₂ , is 10.

This linear model uses the following formula to generate a prediction, y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute to the model. For example, if w ₁ is 0, then the value of x ₁ is irrelevant.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

ز

عادی سازی امتیاز Z

#fundamentals

A scaling technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

ارزش خام	امتیاز Z
800	0
950	+1.5
575	-2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

الف

دقت

#fundamentals

#Metric

The number of correct classification predictions divided by the total number of predictions. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall .

Click the icon for details about accuracy and class-imbalanced datasets.

Although a valuable metric for some situations, accuracy is highly misleading for others. Notably, accuracy is usually a poor metric for evaluating classification models that process class-imbalanced datasets .

For example, suppose snow falls only 25 days per century in a certain subtropical city. Since days without snow (the negative class) vastly outnumber days with snow (the positive class), the snow dataset for this city is class-imbalanced. Imagine a binary classification model that is supposed to predict either snow or no snow each day but simply predicts "no snow" every day. This model is highly accurate but has no predictive power. The following table summarizes the results for a century of predictions:

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

عملکرد فعال سازی

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

ReLU
سیگموئید

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A plot of the sigmoid activation function looks as follows:

Click the icon to see an example.

In a neural network, activation functions manipulate the weighted sum of all the inputs to a neuron . To calculate a weighted sum, the neuron adds up the products of the relevant values and weights. For example, suppose the relevant input to a neuron consists of the following:

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Suppose the designer of this neural network chooses the sigmoid function to be the activation function. In that case, the neuron calculates the sigmoid of -2.0, which is approximately 0.12. Therefore, the neuron passes 0.12 (rather than -2.0) to the next layer in the neural network. The following figure illustrates the relevant part of the process:

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

هوش مصنوعی

#fundamentals

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
The sequence of examples is positive, negative,
positive, negative, positive, negative, positive, negative, positive
negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples. The sequence of examples is negative, negative, negative, negative, positive, negative, positive, positive, negative, positive, positive, مثبت

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an ROC curve . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

AUC is the area of the gray region in the preceding illustration. In this unusual case, the area is simply the length of the gray region (1.0) multiplied by the width of the gray region (1.0). So, the product of 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possible AUC score.

Conversely, the ROC curve for a classification model that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classification model will be more confident than a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ب

پس انتشار

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. اوه!

See Neural networks in Machine Learning Crash Course for more information.

دسته ای

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

اندازه دسته

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

Stochastic Gradient Descent (SGD) , in which the batch size is 1.
Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

برای اطلاعات بیشتر به ادامه مطلب مراجعه کنید:

Production ML systems: Static versus dynamic inference in Machine Learning Crash Course.
Deep Learning Tuning Playbook .

bias (ethics/fairness)

#responsible

#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

ب
w ₀

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

طبقه بندی باینری

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

طبقه مثبت
the negative class

For example, the following two machine learning models each perform binary classification:

A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See Classification in Machine Learning Crash Course for more information.

سطل سازی

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

<= 10 degrees Celsius would be the "cold" bucket.
11 - 24 degrees Celsius would be the "temperate" bucket.
>= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

Click the icon for additional notes.

If you represent temperature as a continuous feature, then the model treats temperature as a single feature. If you represent temperature as three buckets, then the model treats each bucket as a separate feature. That is, a model can learn separate relationships of each bucket to the label . For example, a linear regression model can learn separate weights for each bucket.

Increasing the number of buckets makes your model more complicated by increasing the number of relationships that your model must learn. For example, the cold, temperate, and warm buckets are essentially three separate features for your model to train on. If you decide to add two more buckets--for example, freezing and hot--your model would now have to train on five separate features.

How do you know how many buckets to create, or what the ranges for each bucket should be? The answers typically require a fair amount of experimentation.

See Numerical data: Binning in Machine Learning Crash Course for more information.

سی

داده های طبقه بندی شده

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

red
yellow
green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

کلاس

#fundamentals

A category that a label can belong to. به عنوان مثال:

In a binary classification model that detects spam, the two classes might be spam and not spam .
In a multi-class classification model that identifies dog breeds, the classes might be poodle , beagle , pug , and so on.

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

مدل طبقه بندی

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

A model that predicts an input sentence's language (French? Spanish? Italian?).
A model that predicts tree species (Maple? Oak? Baobab?).
A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

طبقه بندی باینری
multi-class classification

classification threshold

#fundamentals

In a binary classification , a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class . Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

If this raw value is greater than the classification threshold, then the positive class is predicted.
If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

Click the icon for additional notes.

As models or datasets evolve, engineers sometimes also change the classification threshold. When the classification threshold changes, positive class predictions can suddenly become negative classes and vice-versa.

For example, consider a binary classification disease prediction model. Suppose that when the system runs in the first year:

The raw value for a particular patient is 0.95.
The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps, "Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

The raw value for the same patient remains at 0.95.
The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class. ("Happy day! I'm not sick.") Same patient. Different diagnosis.

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

طبقه بندی کننده

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

1,000,000 negative labels
10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

517 negative labels
483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"

See also entropy , majority class , and minority class .

بریدن

#fundamentals

A technique for handling outliers by doing either or both of the following:

Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

Clip all values over 60 (the maximum threshold) to be exactly 60.
Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy . Clipping is a common technique to limit the damage.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

ماتریس سردرگمی

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. For example, consider the following confusion matrix for a binary classification model:

	Tumor (predicted)	Non-Tumor (predicted)
Tumor (ground truth)	18 (TP)	1 (FN)
Non-Tumor (ground truth)	6 (FP)	452 (TN)

The preceding confusion matrix shows the following:

Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa:

	Setosa (predicted)	Versicolor (predicted)	Virginica (predicted)
Setosa (ground truth)	88	12	0
Versicolor (ground truth)	6	141	7
Virginica (ground truth)	2	27	109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

همگرایی

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training تکرارها Loss is very high during first few iterations, but drops sharply. After about 100 iterations, loss is still descending but far more gradually. After about 700 iterations, loss stays flat.

A model converges when additional training won't improve the model.

In deep learning , loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

D

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

a spreadsheet
a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

Contrast with sparse feature .

عمق

#fundamentals

The sum of the following in a neural network :

the number of hidden layers
the number of output layers , which is typically 1
the number of any embedding layers

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

پویا

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

A dynamic model (or online model ) is a model that is retrained frequently or continuously.
Dynamic training (or online training ) is the process of training frequently or continuously.
Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

A model that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model .

Contrast with static model .

E

توقف زودهنگام

#fundamentals

A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.

Click the icon for additional notes.

Early stopping may seem counterintuitive. After all, telling a model to halt training while the loss is still decreasing may seem like telling a chef to stop cooking before the dessert has fully baked. However, training a model for too long can lead to overfitting . That is, if you train a model too long, the model may fit the training data so closely that the model doesn't make good predictions on new examples.

لایه جاسازی

#language

#fundamentals

A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model's input layer includes a one-hot vector 73,000 elements long. For example, perhaps baobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value 0. The next element holds the value 1. The final 66,767 elements hold مقدار صفر

A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

دوران

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

The dataset consists of 1,000 examples.
The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

مثال

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

A labeled example consists of one or more features and a label. Labeled examples are used during training.
An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	خوب
19	34	1020	عالی
18	92	1012	بیچاره

Here are three unlabeled examples:

دما	رطوبت	فشار
12	62	1014
21	47	1017
19	41	1021

The row of a dataset is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include synthetic features , such as feature crosses .

See Supervised Learning in the Introduction to Machine Learning course for more information.

اف

منفی کاذب (FN)

#fundamentals

#Metric

An example in which the model mistakenly predicts the negative class . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam .

مثبت کاذب (FP)

#fundamentals

#Metric

An example in which the model mistakenly predicts the positive class . For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ویژگی

#fundamentals

An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The following table shows three examples, each of which contains three features and one label:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	92
19	34	1020	84
18	92	1012	87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

freezing
chilly
temperate
warm

And represents wind speed in one of the following three buckets:

still
light
windy

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for example, freezing independently of the training on, for example, windy .

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

freezing-still
freezing-light
freezing-windy
chilly-still
chilly-light
chilly-windy
temperate-still
temperate-light
temperate-windy
warm-still
warm-light
warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

مهندسی ویژگی

#fundamentals

#TensorFlow

A process that involves the following steps:

Determining which features might be useful in training a model.
Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

مجموعه ویژگی

#fundamentals

The group of features your machine learning model trains on. For example, a simple feature set for a model that predicts housing prices might consist of postal code, property size, and property condition.

بردار ویژگی

#fundamentals

The array of feature values comprising an example . The feature vector is input during training and during inference . For example, the feature vector for a model with two discrete features might be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
The input layer contains two nodes, one containing the value
0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with one-hot encoding . In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

حلقه بازخورد

#fundamentals

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

جی

تعمیم

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

Click the icon for additional notes.

You train a model on the examples in the training set. Consequently, the model learns the peculiarities of the data in the training set. Generalization essentially asks whether your model can make good predictions on examples that are not in the training set.

To encourage generalization, regularization helps a model train less exactly to the peculiarities of the data in the training set.

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because validation loss ultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis
is labeled iterations. Two plots appear. One plots shows the
training loss and the other shows the validation loss.
The two plots start off similarly, but the training loss eventually
dips far lower than the validation loss.

See Generalization in Machine Learning Crash Course for more information.

شیب نزول

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

حقیقت زمین

#fundamentals

واقعیت.

The thing that actually happened.

For example, consider a binary classification model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truth is not always completely, well, truthful. For example, consider the following examples of potential imperfections in ground truth:

In the graduation example, are we certain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
Suppose the label is a floating-point value measured by instruments (for example, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
If the label is a matter of human opinion, how can we be sure that each human rater is evaluating events in the same way? To improve consistency, expert human raters sometimes intervene.

اچ

لایه پنهان

#fundamentals

A layer in a neural network between the input layer (the features) and the output layer (the prediction). Each hidden layer consists of one or more neurons . For example, the following neural network contains two hidden layers, the first with three neurons and the second with two neurons:

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

هایپرپارامتر

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example, learning rate is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

من

independently and identically distributed (iid)

#fundamentals

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An iid is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be iid over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

استنتاج

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

لایه ورودی

#fundamentals

The layer of a neural network that holds the feature vector . That is, the input layer provides examples for training or inference . For example, the input layer in the following neural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

تفسیر پذیری

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

تکرار

#fundamentals

A single update of a model's parameters—the model's weights and biases —during training . The batch size determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a neural network , a single iteration involves the following two passes:

A forward pass to evaluate loss on a single batch.
A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

L

L ₀ regularization

#fundamentals

A type of regularization that penalizes the total number of nonzero weights in a model. For example, a model having 11 nonzero weights would be penalized more than a similar model having 10 nonzero weights.

L ₀ regularization is sometimes called L0-norm regularization .

Click the icon for additional notes.

L ₀ regularization is generally impractical in large models because L ₀ regularization turns training into a convex optimization problem.

L ₁ loss

#fundamentals

#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L ₁ loss for a batch of five examples :

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₁ regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L ₁ regularization helps drive the weights of irrelevant or barely relevant features to exactly 0 . A feature with a weight of 0 is effectively removed from the model.

Contrast with L ₂ regularization .

L ₂ loss

#fundamentals

#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L ₂ loss for a batch of five examples :

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Due to squaring, L ₂ loss amplifies the influence of outliers . That is, L ₂ loss reacts more strongly to bad predictions than L ₁ loss . For example, the L ₁ loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L ₂ regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L ₂ regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0 . Features with values very close to 0 remain in the model but don't influence the model's prediction very much.

L ₂ regularization always improves generalization in linear models .

Contrast with L ₁ regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

برچسب

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

Each labeled example consists of one or more features and a label. For example, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

تعداد اتاق خواب	Number of bathrooms	House age	House price (label)
3	2	15	345000 دلار
2	1	72	179000 دلار
4	2	34	392000 دلار

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

لامبدا

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

لایه

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

The input layer , which provides values for all the features .
One or more hidden layers , which find nonlinear relationships between the features and the label.
The output layer , which provides the prediction.

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

میزان یادگیری

#fundamentals

A floating-point number that tells the gradient descent algorithm how strongly to adjust weights and biases on each iteration . For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key hyperparameter . If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching convergence .

Click the icon for a more mathematical explanation.

During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

خطی

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

مدل خطی

#fundamentals

A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) In contrast, the relationship of features to predictions in deep models is generally nonlinear .

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

کجا:

y' is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression .)
b is the bias .
w is a weight , so w ₁ is the weight of the first feature, w ₂ is the weight of the second feature, and so on.
x is a feature , so x ₁ is the value of the first feature, x ₂ is the value of the second feature, and so on.

For example, suppose a linear model for three features learns the following bias and weights:

b = 7
w ₁ = -2.5
w ₂ = -1.2
w ₃ = 1.4

Therefore, given three features (x ₁ , x ₂ , and x ₃ ), the linear model uses the following equation to generate each prediction:

y' = 7 + (-2.5)(x₁) + (-1.2)(x₂) + (1.4)(x₃)

Suppose a particular example contains the following values:

x ₁ = 4
x ₂ = -10
x ₃ = 5

Plugging those values into the formula yields a prediction for this example:

y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)
y' = 16

Linear models include not only models that use only a linear equation to make predictions but also a broader set of models that use a linear equation as just one component of the formula that makes predictions. For example, logistic regression post-processes the raw prediction (y') to produce a final prediction value between 0 and 1, exclusively.

رگرسیون خطی

#fundamentals

A type of machine learning model in which both of the following are true:

The model is a linear model .
The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

رگرسیون لجستیک

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

For example, consider a logistic regression model that calculates the probability of an input email being either spam or not spam. During inference, suppose the model predicts 0.72. Therefore, the model is estimating:

A 72% chance of the email being spam.
A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

The model generates a raw prediction (y') by applying a linear function of input features.
The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

از دست دادن گزارش

#fundamentals

The loss function used in binary logistic regression .

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

کجا:

$(x,y)\in D$ is the dataset containing many labeled examples, which are $(x,y)$ جفت
$y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1.
$y'$ is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in $x$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

If the event is a binary probability, then odds refers to the ratio of the probability of success ( p ) to the probability of failure (1- p ). For example, suppose that a given event has a 90% probability of success and a 10% probability of failure. In this case, odds is calculated as follows:

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural logarithm , but logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is therefore:

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

The log-odds function is the inverse of the sigmoid function .

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
rapid drop in loss for the initial iterations, followed by a gradual
drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

از دست دادن آموزش
validation loss
test loss

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

یادگیری ماشینی

#fundamentals

A program or system that trains a model from input data. The trained model can make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

For example, suppose the entire training set (the full batch) consists of 1,000 examples. Further suppose that you set the batch size of each mini-batch to 20. Therefore, each iteration determines the loss on a random 20 of the 1,000 examples and then adjusts the weights and biases accordingly.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

Click the icon for additional notes.

A training set with a million examples sounds impressive. However, if the minority class is poorly represented, then even a very large training set might be insufficient. Focus less on the total number of examples in the dataset and more on the number of examples in the minority class.

If your dataset doesn't contain enough minority class examples, consider using downsampling (the definition in the second bullet) to supplement the minority class.

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

مدل

#fundamentals

In general, any mathematical construct that processes input data and returns output. Phrased differently, a model is the set of parameters and structure needed for a system to make predictions. In supervised machine learning , a model takes an example as input and infers a prediction as output. Within supervised machine learning, models differ somewhat. به عنوان مثال:

A linear regression model consists of a set of weights and a bias .
A neural network model consists of:
- A set of hidden layers , each containing one or more neurons .
- The weights and bias associated with each neuron.
A decision tree model consists of:
- The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
- The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

  f(x, y) = 3x -5xy + y² + 17

The preceding function maps input values ( x and y ) to output.

Similarly, a programming function like the following is also a model:

def half_of_greater(x, y):
  if (x > y):
    return(x / 2)
  else
    return(y / 2)

A caller passes arguments to the preceding Python function, and the Python function generates output (via the return statement).

Although a deep neural network has a very different mathematical structure than an algebraic or programming function, a deep neural network still takes input (an example) and returns output (a prediction).

A human programmer codes a programming function manually. In contrast, a machine learning model gradually learns the optimal parameters during automated training.

multi-class classification

#fundamentals

In supervised learning, a classification problem in which the dataset contains more than two classes of labels. For example, the labels in the Iris dataset must be one of the following three classes:

زنبق ستوزا
زنبق ویرجینیکا
زنبق ورسیکالر

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In contrast, classification problems that distinguish between exactly two classes are binary classification models . For example, an email model that predicts either spam or not spam is a binary classification model.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

ن

negative class

#fundamentals

#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. به عنوان مثال:

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

شبکه عصبی

#fundamentals

A model containing at least one hidden layer . A deep neural network is a type of neural network containing more than one hidden layer. For example, the following diagram shows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an لایه خروجی

Each neuron in a neural network connects to all of the nodes in the next layer. For example, in the preceding diagram, notice that each of the three neurons in the first hidden layer separately connect to both of the two neurons in the second hidden layer.

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See Neural networks in Machine Learning Crash Course for more information.

نورون

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

Calculates the weighted sum of input values multiplied by their corresponding weights.
Passes the weighted sum as input to an activation function .

A neuron in the first hidden layer accepts inputs from the feature values in the input layer . A neuron in any hidden layer beyond the first accepts inputs from the neurons in the preceding hidden layer. For example, a neuron in the second hidden layer accepts inputs from the neurons in the first hidden layer.

The following illustration highlights two neurons and their inputs.

A neural network with an input layer, two hidden layers, and an لایه خروجی Two neurons are highlighted: one in the first hidden layer and one in the second hidden layer. The highlighted neuron in the first hidden layer receives inputs from both features in the input layer. The highlighted neuron in the second hidden layer receives inputs from each of the three neurons in the first hidden لایه.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

غیر خطی

#fundamentals

A relationship between two or more variables that can't be represented solely through addition and multiplication. A linear relationship can be represented as a line; a nonlinear relationship can't be represented as a line. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

دو قطعه One plot is a line, so this is a linear relationship. The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

The number of swimsuits sold at a particular store varies with the season.
The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

عادی سازی

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

-1 to +1
0 به 1
Z-scores (roughly, -3 to +3)

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

داده های عددی

#fundamentals

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

O

آفلاین

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

"Denmark"
"سوئد"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

کشور	بردار
"Denmark"	1	0	0	0	0
"سوئد"	0	1	0	0	0
"Norway"	0	0	1	0	0
"Finland"	0	0	0	1	0
"Iceland"	0	0	0	0	1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

animal versus not animal
vegetable versus not vegetable
mineral versus not mineral

آنلاین

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

بیش از حد

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

See Overfitting in Machine Learning Crash Course for more information.

پ

پانداها

#fundamentals

پارامتر

#fundamentals

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

Admittedly, you're simultaneously testing for both the positive and negative classes.

پس پردازش

#responsible

#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

پیش بینی

#fundamentals

A model's output. به عنوان مثال:

The prediction of a binary classification model is either the positive class or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

See Datasets: Labels in Machine Learning Crash Course for more information.

آر

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

ارزیاب

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

واحد خطی اصلاح شده (ReLU)

#fundamentals

An activation function with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

به عنوان مثال:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

مدل رگرسیون

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

A model that predicts a certain house's value in Euros, such as 423,000.
A model that predicts a certain tree's life expectancy in years, such as 23.2.
A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

Linear regression , which finds the line that best fits label values to features.
Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

منظم سازی

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

L ₁ regularization
L ₂ regularization
تنظیم ترک تحصیل
early stopping (this is not a formal regularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usually increases training loss, which is confusing because, well, isn't the goal to minimize training loss?

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda. The following simplified loss equation shows lambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

where regularization is any regularization mechanism, including;

L ₁ regularization
L ₂ regularization

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

Common motivations to use retrieval-augmented generation include:

Increasing the factual accuracy of a model's generated responses.
Giving the model access to knowledge it was not trained on.
Changing the knowledge that the model uses.
Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

Searches for ("retrieves") data that's relevant to the user's query.
Appends ("augments") the relevant chemistry data to the user's query.
Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

A number line with 8 positive examples on the right side and
7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
to (1.0,1.0).

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The ROC curve approximates a shaky arc
traversing the compass points from West to North.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

اس

sigmoid function

#fundamentals

The sigmoid function has several uses in machine learning, including:

Converting the raw output of a logistic regression or multinomial regression model to a probability.
Acting as an activation function in some neural networks.

Click the icon to see the math.

The sigmoid function over an input number x has the following formula:

$$ sigmoid(x) = \frac{1}{1 + e^{-\text{x}}} $$

In machine learning, x is generally a weighted sum .

سافت مکس

#fundamentals

Image is a...	احتمال
سگ	.85
گربه	.13
اسب	.02

Softmax is also called full softmax .

Contrast with candidate sampling .

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

کجا:

$\sigma_i$ is the output vector. Each element of the output vector specifies the probability of this element. The sum of all the elements in the output vector is 1.0. The output vector contains the same number of elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector contains a floating-point value.
$K$ is the number of elements in the input vector (and the output vector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. اوه!

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#language

#fundamentals

sparse representation

#language

#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

A vector in which positions 0 through 23 hold the value 0, position
24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

Notice that the sparse representation is much more compact than the one-hot representation.

Click the icon for a slightly more complex example.

جمله زیر را در نظر بگیرید:

My dog is a great dog

A sparse representation of the same sentence would simply be:

Click the icon if you are confused.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

ایستا

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

static model (or offline model ) is a model trained once and then used for a while.
static training (or offline training ) is the process of training a static model.
static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

یادگیری ماشینی تحت نظارت

#fundamentals

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

Bucketing a continuous feature into range bins.
Creating a feature cross .
Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
- ab
- یک ²
Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

تی

test loss

#fundamentals

#Metric

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

آموزش

#fundamentals

See Supervised Learning in the Introduction to ML course for more information.

از دست دادن آموزش

#fundamentals

#Metric

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
with a steep downward slope. The slope gradually flattens until the
slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

مجموعه آموزشی

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

منفی واقعی (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

مثبت واقعی (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

U

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of features .
Training for too few epochs or at too low a learning rate .
Training with too high a regularization rate .
Providing too few hidden layers in a deep neural network.

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

تعداد اتاق خواب	Number of bathrooms	House age
3	2	15
2	1	72
4	2	34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

یادگیری ماشینی بدون نظارت

#clustering

#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

Contrast with supervised machine learning .

Click the icon for additional notes.

See What is Machine Learning? in the Introduction to ML course for more information.

V

اعتبار سنجی

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

مجموعه اعتبار سنجی

#fundamentals

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

دبلیو

وزن

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

Click the icon to see an example of weights in a linear model.

Imagine a linear model with two features. Suppose that training determines the following weights (and bias ):

The bias, b, has a value of 2.2
The weight, w ₁ associated with one feature is 1.5.
The weight, w ₂ associated with the other feature is 0.4.

Now imagine an example with the following feature values:

The value of one feature, x ₁ , is 6.
The value of the other feature, x ₂ , is 10.

This linear model uses the following formula to generate a prediction, y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute to the model. For example, if w ₁ is 0, then the value of x ₁ is irrelevant.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

ز

عادی سازی امتیاز Z

#fundamentals

ارزش خام	امتیاز Z
800	0
950	+1.5
575	-2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

الف

دقت

#fundamentals

#Metric

The number of correct classification predictions divided by the total number of predictions. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall .

Click the icon for details about accuracy and class-imbalanced datasets.

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

عملکرد فعال سازی

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

ReLU
سیگموئید

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A plot of the sigmoid activation function looks as follows:

Click the icon to see an example.

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

هوش مصنوعی

#fundamentals

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number line with 8 positive examples on one side and
9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an ROC curve . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

Conversely, the ROC curve for a classification model that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classification model will be more confident than a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ب

پس انتشار

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. اوه!

See Neural networks in Machine Learning Crash Course for more information.

دسته ای

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

اندازه دسته

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

Stochastic Gradient Descent (SGD) , in which the batch size is 1.
Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

برای اطلاعات بیشتر به ادامه مطلب مراجعه کنید:

Production ML systems: Static versus dynamic inference in Machine Learning Crash Course.
Deep Learning Tuning Playbook .

bias (ethics/fairness)

#responsible

#fundamentals

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

ب
w ₀

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

طبقه بندی باینری

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

طبقه مثبت
the negative class

For example, the following two machine learning models each perform binary classification:

A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See Classification in Machine Learning Crash Course for more information.

سطل سازی

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

<= 10 degrees Celsius would be the "cold" bucket.
11 - 24 degrees Celsius would be the "temperate" bucket.
>= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

Click the icon for additional notes.

How do you know how many buckets to create, or what the ranges for each bucket should be? The answers typically require a fair amount of experimentation.

See Numerical data: Binning in Machine Learning Crash Course for more information.

سی

داده های طبقه بندی شده

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

red
yellow
green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

کلاس

#fundamentals

A category that a label can belong to. به عنوان مثال:

In a binary classification model that detects spam, the two classes might be spam and not spam .
In a multi-class classification model that identifies dog breeds, the classes might be poodle , beagle , pug , and so on.

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

مدل طبقه بندی

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

A model that predicts an input sentence's language (French? Spanish? Italian?).
A model that predicts tree species (Maple? Oak? Baobab?).
A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

طبقه بندی باینری
multi-class classification

classification threshold

#fundamentals

A logistic regression model outputs a raw value between 0 and 1. Then:

If this raw value is greater than the classification threshold, then the positive class is predicted.
If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

Click the icon for additional notes.

For example, consider a binary classification disease prediction model. Suppose that when the system runs in the first year:

The raw value for a particular patient is 0.95.
The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps, "Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

The raw value for the same patient remains at 0.95.
The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class. ("Happy day! I'm not sick.") Same patient. Different diagnosis.

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

طبقه بندی کننده

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

1,000,000 negative labels
10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

517 negative labels
483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"

See also entropy , majority class , and minority class .

بریدن

#fundamentals

A technique for handling outliers by doing either or both of the following:

Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

Clip all values over 60 (the maximum threshold) to be exactly 60.
Clip all values under 40 (the minimum threshold) to be exactly 40.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

ماتریس سردرگمی

#fundamentals

	Tumor (predicted)	Non-Tumor (predicted)
Tumor (ground truth)	18 (TP)	1 (FN)
Non-Tumor (ground truth)	6 (FP)	452 (TN)

The preceding confusion matrix shows the following:

Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

	Setosa (predicted)	Versicolor (predicted)	Virginica (predicted)
Setosa (ground truth)	88	12	0
Versicolor (ground truth)	6	141	7
Virginica (ground truth)	2	27	109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

همگرایی

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

A model converges when additional training won't improve the model.

D

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

a spreadsheet
a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

Contrast with sparse feature .

عمق

#fundamentals

The sum of the following in a neural network :

the number of hidden layers
the number of output layers , which is typically 1
the number of any embedding layers

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

پویا

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

A dynamic model (or online model ) is a model that is retrained frequently or continuously.
Dynamic training (or online training ) is the process of training frequently or continuously.
Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

Contrast with static model .

E

توقف زودهنگام

#fundamentals

Click the icon for additional notes.

لایه جاسازی

#language

#fundamentals

An array of 73,000 elements. The first 6,232 elements hold the value 0. The next element holds the value 1. The final 66,767 elements hold مقدار صفر

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

دوران

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

The dataset consists of 1,000 examples.
The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

مثال

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

A labeled example consists of one or more features and a label. Labeled examples are used during training.
An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	خوب
19	34	1020	عالی
18	92	1012	بیچاره

Here are three unlabeled examples:

دما	رطوبت	فشار
12	62	1014
21	47	1017
19	41	1021

See Supervised Learning in the Introduction to Machine Learning course for more information.

اف

منفی کاذب (FN)

#fundamentals

#Metric

مثبت کاذب (FP)

#fundamentals

#Metric

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ویژگی

#fundamentals

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	92
19	34	1020	84
18	92	1012	87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

freezing
chilly
temperate
warm

And represents wind speed in one of the following three buckets:

still
light
windy

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

freezing-still
freezing-light
freezing-windy
chilly-still
chilly-light
chilly-windy
temperate-still
temperate-light
temperate-windy
warm-still
warm-light
warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

مهندسی ویژگی

#fundamentals

#TensorFlow

A process that involves the following steps:

Determining which features might be useful in training a model.
Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

مجموعه ویژگی

#fundamentals

بردار ویژگی

#fundamentals

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
The input layer contains two nodes, one containing the value
0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

حلقه بازخورد

#fundamentals

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

جی

تعمیم

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

Click the icon for additional notes.

To encourage generalization, regularization helps a model train less exactly to the peculiarities of the data in the training set.

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

See Generalization in Machine Learning Crash Course for more information.

شیب نزول

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

حقیقت زمین

#fundamentals

واقعیت.

The thing that actually happened.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truth is not always completely, well, truthful. For example, consider the following examples of potential imperfections in ground truth:

In the graduation example, are we certain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
Suppose the label is a floating-point value measured by instruments (for example, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
If the label is a matter of human opinion, how can we be sure that each human rater is evaluating events in the same way? To improve consistency, expert human raters sometimes intervene.

اچ

لایه پنهان

#fundamentals

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

هایپرپارامتر

#fundamentals

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

من

independently and identically distributed (iid)

#fundamentals

استنتاج

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

لایه ورودی

#fundamentals

Four layers: an input layer, two hidden layers, and an output layer.

تفسیر پذیری

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

تکرار

#fundamentals

When training a neural network , a single iteration involves the following two passes:

A forward pass to evaluate loss on a single batch.
A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

L

L ₀ regularization

#fundamentals

L ₀ regularization is sometimes called L0-norm regularization .

Click the icon for additional notes.

L ₀ regularization is generally impractical in large models because L ₀ regularization turns training into a convex optimization problem.

L ₁ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₁ regularization

#fundamentals

Contrast with L ₂ regularization .

L ₂ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L ₂ regularization

#fundamentals

L ₂ regularization always improves generalization in linear models .

Contrast with L ₁ regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

برچسب

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

تعداد اتاق خواب	Number of bathrooms	House age	House price (label)
3	2	15	345000 دلار
2	1	72	179000 دلار
4	2	34	392000 دلار

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

لامبدا

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

لایه

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

The input layer , which provides values for all the features .
One or more hidden layers , which find nonlinear relationships between the features and the label.
The output layer , which provides the prediction.

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

میزان یادگیری

#fundamentals

Click the icon for a more mathematical explanation.

During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

خطی

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

مدل خطی

#fundamentals

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

کجا:

y' is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression .)
b is the bias .
w is a weight , so w ₁ is the weight of the first feature, w ₂ is the weight of the second feature, and so on.
x is a feature , so x ₁ is the value of the first feature, x ₂ is the value of the second feature, and so on.

For example, suppose a linear model for three features learns the following bias and weights:

b = 7
w ₁ = -2.5
w ₂ = -1.2
w ₃ = 1.4

Therefore, given three features (x ₁ , x ₂ , and x ₃ ), the linear model uses the following equation to generate each prediction:

y' = 7 + (-2.5)(x₁) + (-1.2)(x₂) + (1.4)(x₃)

Suppose a particular example contains the following values:

x ₁ = 4
x ₂ = -10
x ₃ = 5

Plugging those values into the formula yields a prediction for this example:

y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)
y' = 16

رگرسیون خطی

#fundamentals

A type of machine learning model in which both of the following are true:

The model is a linear model .
The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

رگرسیون لجستیک

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

A 72% chance of the email being spam.
A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

The model generates a raw prediction (y') by applying a linear function of input features.
The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

از دست دادن گزارش

#fundamentals

The loss function used in binary logistic regression .

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

کجا:

$(x,y)\in D$ is the dataset containing many labeled examples, which are $(x,y)$ جفت
$y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1.
$y'$ is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in $x$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

The log-odds function is the inverse of the sigmoid function .

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
rapid drop in loss for the initial iterations, followed by a gradual
drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

از دست دادن آموزش
validation loss
test loss

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

یادگیری ماشینی

#fundamentals

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

Click the icon for additional notes.

If your dataset doesn't contain enough minority class examples, consider using downsampling (the definition in the second bullet) to supplement the minority class.

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

مدل

#fundamentals

A linear regression model consists of a set of weights and a bias .
A neural network model consists of:
- A set of hidden layers , each containing one or more neurons .
- The weights and bias associated with each neuron.
A decision tree model consists of:
- The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
- The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

  f(x, y) = 3x -5xy + y² + 17

The preceding function maps input values ( x and y ) to output.

Similarly, a programming function like the following is also a model:

def half_of_greater(x, y):
  if (x > y):
    return(x / 2)
  else
    return(y / 2)

A caller passes arguments to the preceding Python function, and the Python function generates output (via the return statement).

A human programmer codes a programming function manually. In contrast, a machine learning model gradually learns the optimal parameters during automated training.

multi-class classification

#fundamentals

زنبق ستوزا
زنبق ویرجینیکا
زنبق ورسیکالر

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

ن

negative class

#fundamentals

#Metric

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

شبکه عصبی

#fundamentals

A neural network with an input layer, two hidden layers, and an لایه خروجی

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See Neural networks in Machine Learning Crash Course for more information.

نورون

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

Calculates the weighted sum of input values multiplied by their corresponding weights.
Passes the weighted sum as input to an activation function .

The following illustration highlights two neurons and their inputs.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

غیر خطی

#fundamentals

دو قطعه One plot is a line, so this is a linear relationship. The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

The number of swimsuits sold at a particular store varies with the season.
The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

عادی سازی

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

-1 to +1
0 به 1
Z-scores (roughly, -3 to +3)

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

داده های عددی

#fundamentals

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

O

آفلاین

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

"Denmark"
"سوئد"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

کشور	بردار
"Denmark"	1	0	0	0	0
"سوئد"	0	1	0	0	0
"Norway"	0	0	1	0	0
"Finland"	0	0	0	1	0
"Iceland"	0	0	0	0	1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

animal versus not animal
vegetable versus not vegetable
mineral versus not mineral

آنلاین

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

بیش از حد

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

See Overfitting in Machine Learning Crash Course for more information.

پ

پانداها

#fundamentals

پارامتر

#fundamentals

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

Admittedly, you're simultaneously testing for both the positive and negative classes.

پس پردازش

#responsible

#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

پیش بینی

#fundamentals

A model's output. به عنوان مثال:

The prediction of a binary classification model is either the positive class or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

See Datasets: Labels in Machine Learning Crash Course for more information.

آر

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

ارزیاب

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

واحد خطی اصلاح شده (ReLU)

#fundamentals

An activation function with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

به عنوان مثال:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

مدل رگرسیون

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

A model that predicts a certain house's value in Euros, such as 423,000.
A model that predicts a certain tree's life expectancy in years, such as 23.2.
A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

Linear regression , which finds the line that best fits label values to features.
Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

منظم سازی

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

L ₁ regularization
L ₂ regularization
تنظیم ترک تحصیل
early stopping (this is not a formal regularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usually increases training loss, which is confusing because, well, isn't the goal to minimize training loss?

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda. The following simplified loss equation shows lambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

where regularization is any regularization mechanism, including;

L ₁ regularization
L ₂ regularization

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

Common motivations to use retrieval-augmented generation include:

Increasing the factual accuracy of a model's generated responses.
Giving the model access to knowledge it was not trained on.
Changing the knowledge that the model uses.
Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

Searches for ("retrieves") data that's relevant to the user's query.
Appends ("augments") the relevant chemistry data to the user's query.
Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

A number line with 8 positive examples on the right side and
7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
to (1.0,1.0).

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The ROC curve approximates a shaky arc
traversing the compass points from West to North.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

اس

sigmoid function

#fundamentals

The sigmoid function has several uses in machine learning, including:

Converting the raw output of a logistic regression or multinomial regression model to a probability.
Acting as an activation function in some neural networks.

Click the icon to see the math.

The sigmoid function over an input number x has the following formula:

$$ sigmoid(x) = \frac{1}{1 + e^{-\text{x}}} $$

In machine learning, x is generally a weighted sum .

سافت مکس

#fundamentals

Image is a...	احتمال
سگ	.85
گربه	.13
اسب	.02

Softmax is also called full softmax .

Contrast with candidate sampling .

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

کجا:

$\sigma_i$ is the output vector. Each element of the output vector specifies the probability of this element. The sum of all the elements in the output vector is 1.0. The output vector contains the same number of elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector contains a floating-point value.
$K$ is the number of elements in the input vector (and the output vector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. اوه!

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#language

#fundamentals

sparse representation

#language

#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

A vector in which positions 0 through 23 hold the value 0, position
24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

Notice that the sparse representation is much more compact than the one-hot representation.

Click the icon for a slightly more complex example.

جمله زیر را در نظر بگیرید:

My dog is a great dog

A sparse representation of the same sentence would simply be:

Click the icon if you are confused.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

ایستا

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

static model (or offline model ) is a model trained once and then used for a while.
static training (or offline training ) is the process of training a static model.
static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

یادگیری ماشینی تحت نظارت

#fundamentals

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

Bucketing a continuous feature into range bins.
Creating a feature cross .
Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
- ab
- یک ²
Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

تی

test loss

#fundamentals

#Metric

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

آموزش

#fundamentals

See Supervised Learning in the Introduction to ML course for more information.

از دست دادن آموزش

#fundamentals

#Metric

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
with a steep downward slope. The slope gradually flattens until the
slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

مجموعه آموزشی

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

منفی واقعی (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

مثبت واقعی (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

U

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of features .
Training for too few epochs or at too low a learning rate .
Training with too high a regularization rate .
Providing too few hidden layers in a deep neural network.

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

تعداد اتاق خواب	Number of bathrooms	House age
3	2	15
2	1	72
4	2	34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

یادگیری ماشینی بدون نظارت

#clustering

#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

Contrast with supervised machine learning .

Click the icon for additional notes.

See What is Machine Learning? in the Introduction to ML course for more information.

V

اعتبار سنجی

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

مجموعه اعتبار سنجی

#fundamentals

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

دبلیو

وزن

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

Click the icon to see an example of weights in a linear model.

Imagine a linear model with two features. Suppose that training determines the following weights (and bias ):

The bias, b, has a value of 2.2
The weight, w ₁ associated with one feature is 1.5.
The weight, w ₂ associated with the other feature is 0.4.

Now imagine an example with the following feature values:

The value of one feature, x ₁ , is 6.
The value of the other feature, x ₂ , is 10.

This linear model uses the following formula to generate a prediction, y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute to the model. For example, if w ₁ is 0, then the value of x ₁ is irrelevant.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

ز

عادی سازی امتیاز Z

#fundamentals

ارزش خام	امتیاز Z
800	0
950	+1.5
575	-2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

الف

دقت

#fundamentals

#Metric

The number of correct classification predictions divided by the total number of predictions. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall .

Click the icon for details about accuracy and class-imbalanced datasets.

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

عملکرد فعال سازی

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

ReLU
سیگموئید

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A plot of the sigmoid activation function looks as follows:

Click the icon to see an example.

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

هوش مصنوعی

#fundamentals

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number line with 8 positive examples on one side and
9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an ROC curve . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

Conversely, the ROC curve for a classification model that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classification model will be more confident than a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ب

پس انتشار

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. اوه!

See Neural networks in Machine Learning Crash Course for more information.

دسته ای

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

اندازه دسته

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

Stochastic Gradient Descent (SGD) , in which the batch size is 1.
Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

برای اطلاعات بیشتر به ادامه مطلب مراجعه کنید:

Production ML systems: Static versus dynamic inference in Machine Learning Crash Course.
Deep Learning Tuning Playbook .

bias (ethics/fairness)

#responsible

#fundamentals

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

ب
w ₀

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

طبقه بندی باینری

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

طبقه مثبت
the negative class

For example, the following two machine learning models each perform binary classification:

A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See Classification in Machine Learning Crash Course for more information.

سطل سازی

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

<= 10 degrees Celsius would be the "cold" bucket.
11 - 24 degrees Celsius would be the "temperate" bucket.
>= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

Click the icon for additional notes.

How do you know how many buckets to create, or what the ranges for each bucket should be? The answers typically require a fair amount of experimentation.

See Numerical data: Binning in Machine Learning Crash Course for more information.

سی

داده های طبقه بندی شده

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

red
yellow
green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

کلاس

#fundamentals

A category that a label can belong to. به عنوان مثال:

In a binary classification model that detects spam, the two classes might be spam and not spam .
In a multi-class classification model that identifies dog breeds, the classes might be poodle , beagle , pug , and so on.

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

مدل طبقه بندی

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

A model that predicts an input sentence's language (French? Spanish? Italian?).
A model that predicts tree species (Maple? Oak? Baobab?).
A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

طبقه بندی باینری
multi-class classification

classification threshold

#fundamentals

A logistic regression model outputs a raw value between 0 and 1. Then:

If this raw value is greater than the classification threshold, then the positive class is predicted.
If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

Click the icon for additional notes.

For example, consider a binary classification disease prediction model. Suppose that when the system runs in the first year:

The raw value for a particular patient is 0.95.
The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps, "Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

The raw value for the same patient remains at 0.95.
The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class. ("Happy day! I'm not sick.") Same patient. Different diagnosis.

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

طبقه بندی کننده

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

1,000,000 negative labels
10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

517 negative labels
483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"

See also entropy , majority class , and minority class .

بریدن

#fundamentals

A technique for handling outliers by doing either or both of the following:

Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

Clip all values over 60 (the maximum threshold) to be exactly 60.
Clip all values under 40 (the minimum threshold) to be exactly 40.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

ماتریس سردرگمی

#fundamentals

	Tumor (predicted)	Non-Tumor (predicted)
Tumor (ground truth)	18 (TP)	1 (FN)
Non-Tumor (ground truth)	6 (FP)	452 (TN)

The preceding confusion matrix shows the following:

Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

	Setosa (predicted)	Versicolor (predicted)	Virginica (predicted)
Setosa (ground truth)	88	12	0
Versicolor (ground truth)	6	141	7
Virginica (ground truth)	2	27	109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

همگرایی

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

A model converges when additional training won't improve the model.

D

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

a spreadsheet
a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

Contrast with sparse feature .

عمق

#fundamentals

The sum of the following in a neural network :

the number of hidden layers
the number of output layers , which is typically 1
the number of any embedding layers

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

پویا

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

A dynamic model (or online model ) is a model that is retrained frequently or continuously.
Dynamic training (or online training ) is the process of training frequently or continuously.
Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

Contrast with static model .

E

توقف زودهنگام

#fundamentals

Click the icon for additional notes.

لایه جاسازی

#language

#fundamentals

An array of 73,000 elements. The first 6,232 elements hold the value 0. The next element holds the value 1. The final 66,767 elements hold مقدار صفر

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

دوران

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

The dataset consists of 1,000 examples.
The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

مثال

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

A labeled example consists of one or more features and a label. Labeled examples are used during training.
An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	خوب
19	34	1020	عالی
18	92	1012	بیچاره

Here are three unlabeled examples:

دما	رطوبت	فشار
12	62	1014
21	47	1017
19	41	1021

See Supervised Learning in the Introduction to Machine Learning course for more information.

اف

منفی کاذب (FN)

#fundamentals

#Metric

مثبت کاذب (FP)

#fundamentals

#Metric

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

ویژگی

#fundamentals

ویژگی ها			برچسب بزنید
دما	رطوبت	فشار	نمره آزمون
15	47	998	92
19	34	1020	84
18	92	1012	87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

freezing
chilly
temperate
warm

And represents wind speed in one of the following three buckets:

still
light
windy

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

freezing-still
freezing-light
freezing-windy
chilly-still
chilly-light
chilly-windy
temperate-still
temperate-light
temperate-windy
warm-still
warm-light
warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

مهندسی ویژگی

#fundamentals

#TensorFlow

A process that involves the following steps:

Determining which features might be useful in training a model.
Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

مجموعه ویژگی

#fundamentals

بردار ویژگی

#fundamentals

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
The input layer contains two nodes, one containing the value
0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

حلقه بازخورد

#fundamentals

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

جی

تعمیم

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

Click the icon for additional notes.

To encourage generalization, regularization helps a model train less exactly to the peculiarities of the data in the training set.

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

See Generalization in Machine Learning Crash Course for more information.

شیب نزول

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

حقیقت زمین

#fundamentals

واقعیت.

The thing that actually happened.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truth is not always completely, well, truthful. For example, consider the following examples of potential imperfections in ground truth:

In the graduation example, are we certain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
Suppose the label is a floating-point value measured by instruments (for example, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
If the label is a matter of human opinion, how can we be sure that each human rater is evaluating events in the same way? To improve consistency, expert human raters sometimes intervene.

اچ

لایه پنهان

#fundamentals

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

هایپرپارامتر

#fundamentals

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

من

independently and identically distributed (iid)

#fundamentals

استنتاج

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

لایه ورودی

#fundamentals

Four layers: an input layer, two hidden layers, and an output layer.

تفسیر پذیری

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

تکرار

#fundamentals

When training a neural network , a single iteration involves the following two passes:

A forward pass to evaluate loss on a single batch.
A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

L

L ₀ regularization

#fundamentals

L ₀ regularization is sometimes called L0-norm regularization .

Click the icon for additional notes.

L ₀ regularization is generally impractical in large models because L ₀ regularization turns training into a convex optimization problem.

L ₁ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₁ regularization

#fundamentals

Contrast with L ₂ regularization .

L ₂ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L ₂ regularization

#fundamentals

L ₂ regularization always improves generalization in linear models .

Contrast with L ₁ regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

برچسب

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

تعداد اتاق خواب	Number of bathrooms	House age	House price (label)
3	2	15	345000 دلار
2	1	72	179000 دلار
4	2	34	392000 دلار

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

لامبدا

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

لایه

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

The input layer , which provides values for all the features .
One or more hidden layers , which find nonlinear relationships between the features and the label.
The output layer , which provides the prediction.

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

میزان یادگیری

#fundamentals

Click the icon for a more mathematical explanation.

During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

خطی

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

مدل خطی

#fundamentals

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

کجا:

y' is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression .)
b is the bias .
w is a weight , so w ₁ is the weight of the first feature, w ₂ is the weight of the second feature, and so on.
x is a feature , so x ₁ is the value of the first feature, x ₂ is the value of the second feature, and so on.

For example, suppose a linear model for three features learns the following bias and weights:

b = 7
w ₁ = -2.5
w ₂ = -1.2
w ₃ = 1.4

Therefore, given three features (x ₁ , x ₂ , and x ₃ ), the linear model uses the following equation to generate each prediction:

y' = 7 + (-2.5)(x₁) + (-1.2)(x₂) + (1.4)(x₃)

Suppose a particular example contains the following values:

x ₁ = 4
x ₂ = -10
x ₃ = 5

Plugging those values into the formula yields a prediction for this example:

y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)
y' = 16

رگرسیون خطی

#fundamentals

A type of machine learning model in which both of the following are true:

The model is a linear model .
The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

رگرسیون لجستیک

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

A 72% chance of the email being spam.
A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

The model generates a raw prediction (y') by applying a linear function of input features.
The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

از دست دادن گزارش

#fundamentals

The loss function used in binary logistic regression .

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

کجا:

$(x,y)\in D$ is the dataset containing many labeled examples, which are $(x,y)$ جفت
$y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1.
$y'$ is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in $x$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

The log-odds function is the inverse of the sigmoid function .

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
rapid drop in loss for the initial iterations, followed by a gradual
drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

از دست دادن آموزش
validation loss
test loss

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

یادگیری ماشینی

#fundamentals

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

Click the icon for additional notes.

If your dataset doesn't contain enough minority class examples, consider using downsampling (the definition in the second bullet) to supplement the minority class.

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

مدل

#fundamentals

A linear regression model consists of a set of weights and a bias .
A neural network model consists of:
- A set of hidden layers , each containing one or more neurons .
- The weights and bias associated with each neuron.
A decision tree model consists of:
- The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
- The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

  f(x, y) = 3x -5xy + y² + 17

The preceding function maps input values ( x and y ) to output.

Similarly, a programming function like the following is also a model:

def half_of_greater(x, y):
  if (x > y):
    return(x / 2)
  else
    return(y / 2)

A caller passes arguments to the preceding Python function, and the Python function generates output (via the return statement).

A human programmer codes a programming function manually. In contrast, a machine learning model gradually learns the optimal parameters during automated training.

multi-class classification

#fundamentals

زنبق ستوزا
زنبق ویرجینیکا
زنبق ورسیکالر

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

ن

negative class

#fundamentals

#Metric

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

شبکه عصبی

#fundamentals

A neural network with an input layer, two hidden layers, and an لایه خروجی

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See Neural networks in Machine Learning Crash Course for more information.

نورون

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

Calculates the weighted sum of input values multiplied by their corresponding weights.
Passes the weighted sum as input to an activation function .

The following illustration highlights two neurons and their inputs.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

غیر خطی

#fundamentals

دو قطعه One plot is a line, so this is a linear relationship. The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

The number of swimsuits sold at a particular store varies with the season.
The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

عادی سازی

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

-1 to +1
0 به 1
Z-scores (roughly, -3 to +3)

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

داده های عددی

#fundamentals

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

O

آفلاین

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

"Denmark"
"سوئد"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

کشور	بردار
"Denmark"	1	0	0	0	0
"سوئد"	0	1	0	0	0
"Norway"	0	0	1	0	0
"Finland"	0	0	0	1	0
"Iceland"	0	0	0	0	1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

animal versus not animal
vegetable versus not vegetable
mineral versus not mineral

آنلاین

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

بیش از حد

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

See Overfitting in Machine Learning Crash Course for more information.

پ

پانداها

#fundamentals

پارامتر

#fundamentals

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

Admittedly, you're simultaneously testing for both the positive and negative classes.

پس پردازش

#responsible

#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

پیش بینی

#fundamentals

A model's output. به عنوان مثال:

The prediction of a binary classification model is either the positive class or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

See Datasets: Labels in Machine Learning Crash Course for more information.

آر

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

ارزیاب

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

واحد خطی اصلاح شده (ReLU)

#fundamentals

An activation function with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

به عنوان مثال:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

مدل رگرسیون

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

A model that predicts a certain house's value in Euros, such as 423,000.
A model that predicts a certain tree's life expectancy in years, such as 23.2.
A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

Linear regression , which finds the line that best fits label values to features.
Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

منظم سازی

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

L ₁ regularization
L ₂ regularization
تنظیم ترک تحصیل
early stopping (this is not a formal regularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usually increases training loss, which is confusing because, well, isn't the goal to minimize training loss?

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda. The following simplified loss equation shows lambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

where regularization is any regularization mechanism, including;

L ₁ regularization
L ₂ regularization

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

Common motivations to use retrieval-augmented generation include:

Increasing the factual accuracy of a model's generated responses.
Giving the model access to knowledge it was not trained on.
Changing the knowledge that the model uses.
Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

Searches for ("retrieves") data that's relevant to the user's query.
Appends ("augments") the relevant chemistry data to the user's query.
Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

A number line with 8 positive examples on the right side and
7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
to (1.0,1.0).

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The ROC curve approximates a shaky arc
traversing the compass points from West to North.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

اس

sigmoid function

#fundamentals

The sigmoid function has several uses in machine learning, including:

Converting the raw output of a logistic regression or multinomial regression model to a probability.
Acting as an activation function in some neural networks.

Click the icon to see the math.

The sigmoid function over an input number x has the following formula:

$$ sigmoid(x) = \frac{1}{1 + e^{-\text{x}}} $$

In machine learning, x is generally a weighted sum .

سافت مکس

#fundamentals

Image is a...	احتمال
سگ	.85
گربه	.13
اسب	.02

Softmax is also called full softmax .

Contrast with candidate sampling .

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

کجا:

$\sigma_i$ is the output vector. Each element of the output vector specifies the probability of this element. The sum of all the elements in the output vector is 1.0. The output vector contains the same number of elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector contains a floating-point value.
$K$ is the number of elements in the input vector (and the output vector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. اوه!

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#language

#fundamentals

sparse representation

#language

#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

A vector in which positions 0 through 23 hold the value 0, position
24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

Notice that the sparse representation is much more compact than the one-hot representation.

Click the icon for a slightly more complex example.

جمله زیر را در نظر بگیرید:

My dog is a great dog

A sparse representation of the same sentence would simply be:

Click the icon if you are confused.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

ایستا

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

static model (or offline model ) is a model trained once and then used for a while.
static training (or offline training ) is the process of training a static model.
static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

یادگیری ماشینی تحت نظارت

#fundamentals

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

Bucketing a continuous feature into range bins.
Creating a feature cross .
Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
- ab
- یک ²
Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

تی

test loss

#fundamentals

#Metric

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

آموزش

#fundamentals

See Supervised Learning in the Introduction to ML course for more information.

از دست دادن آموزش

#fundamentals

#Metric

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
with a steep downward slope. The slope gradually flattens until the
slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

مجموعه آموزشی

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

منفی واقعی (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

مثبت واقعی (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

U

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of features .
Training for too few epochs or at too low a learning rate .
Training with too high a regularization rate .
Providing too few hidden layers in a deep neural network.

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

تعداد اتاق خواب	Number of bathrooms	House age
3	2	15
2	1	72
4	2	34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

یادگیری ماشینی بدون نظارت

#clustering

#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

Contrast with supervised machine learning .

Click the icon for additional notes.

See What is Machine Learning? in the Introduction to ML course for more information.

V

اعتبار سنجی

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

مجموعه اعتبار سنجی

#fundamentals

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

یک مجموعه آموزشی
a validation set
a test set

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

دبلیو

وزن

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

Click the icon to see an example of weights in a linear model.

Imagine a linear model with two features. Suppose that training determines the following weights (and bias ):

The bias, b, has a value of 2.2
The weight, w ₁ associated with one feature is 1.5.
The weight, w ₂ associated with the other feature is 0.4.

Now imagine an example with the following feature values:

The value of one feature, x ₁ , is 6.
The value of the other feature, x ₂ , is 10.

This linear model uses the following formula to generate a prediction, y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute to the model. For example, if w ₁ is 0, then the value of x ₁ is irrelevant.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

مقدار ورودی	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

ز

عادی سازی امتیاز Z

#fundamentals

ارزش خام	امتیاز Z
800	0
950	+1.5
575	-2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

واژه نامه یادگیری ماشینی: مبانی ML با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

الف

دقت

برای جزئیات در مورد دقت و مجموعه داده های نامتعادل کلاس، روی نماد کلیک کنید.

عملکرد فعال سازی

برای مشاهده نمونه روی نماد کلیک کنید.

هوش مصنوعی

AUC (مساحت زیر منحنی ROC)

برای اطلاع از رابطه بین منحنی های AUC و ROC روی نماد کلیک کنید.

برای تعریف رسمی تر AUC روی نماد کلیک کنید.

ب

پس انتشار

دسته ای

اندازه دسته

تعصب (اخلاق / انصاف)

تعصب (ریاضی) یا اصطلاح سوگیری

طبقه بندی باینری

سطل سازی

برای یادداشت های اضافی روی نماد کلیک کنید.

سی

داده های طبقه بندی شده

کلاس

مدل طبقه بندی

آستانه طبقه بندی

برای یادداشت های اضافی روی نماد کلیک کنید.

طبقه بندی کننده

مجموعه داده های کلاس نامتعادل

بریدن

ماتریس سردرگمی

ویژگی پیوسته

همگرایی

D

DataFrame

مجموعه داده یا مجموعه داده

مدل عمیق

ویژگی متراکم

عمق

ویژگی گسسته

پویا

مدل پویا

E

توقف زودهنگام

برای یادداشت های اضافی روی نماد کلیک کنید.

لایه جاسازی

دوران

مثال

اف

منفی کاذب (FN)

مثبت کاذب (FP)

نرخ مثبت کاذب (FPR)

ویژگی

متقاطع ویژگی

مهندسی ویژگی

برای یادداشت های اضافی در مورد TensorFlow روی نماد کلیک کنید.

مجموعه ویژگی

بردار ویژگی

حلقه بازخورد

جی

تعمیم

برای یادداشت های اضافی روی نماد کلیک کنید.

منحنی تعمیم

شیب نزول

حقیقت زمین

برای یادداشت های اضافی روی نماد کلیک کنید.

اچ

لایه پنهان

هایپرپارامتر

من

به طور مستقل و یکسان توزیع شده (IID)

استنتاج

لایه ورودی

تفسیر پذیری

تکرار

L

تنظیم منظم

برای یادداشت های اضافی روی نماد کلیک کنید.

L 1 ضرر

برای دیدن ریاضی رسمی ، روی نماد کلیک کنید.

l 1 منظم سازی

L 2 ضرر

واژه نامه یادگیری ماشینی: مبانی ML
با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

تنظیم _منظم

L ₁ ضرر

l ₁ منظم سازی

L ₂ ضرر

تنظیم منظم L ₂