Вы находитесь на странице: 1из 35

METODI DI INTELLIGENZA ARTIFICIALE

E MACHINE LEARNING IN FISICA


Lezioni Dottorato di Ricerca in Fisica - XXXIV ciclo
Roma Febbraio-Maggio 2019
S. Giagu

Lezione 6 - 4.4.2019
CALENDARIO LEZIONI
• tutte in aula Rasetti, inizio ore 17:00

✓ Giovedì 28 Febbraio
✓ Giovedì 7 Marzo
✓ Mercoledì 13 Marzo ≪ eccezione
✓ Giovedì 21 Marzo
✓ Giovedì 28 Marzo
✓ Giovedì 4 Aprile
• Giovedì 11 Aprile
• Mercoledì 17 Aprile ≪ eccezione
• Giovedì 2 Maggio
• Giovedì 9 Maggio

2
OUTLINE LESSON #6
• L5: Artificial Neural Networks 101 - part [2]
• back-propagation algorithm description in more detail
• regularisation of a neural network
• practical methodology via a complete example: MNIST dataset classification
with NN implemented in keras/tensorflow with hyper-parameter tuning

3
BACK-PROPAGATION
• the training of a NN happens in two phases:
• Forward phase: the weights are fixed and the input vector is propagated layer
by layer up to the output neurons (function signal)
• Backward phase: the error Δ obtained by comparing output with target is
propagated backward, again layer by layer (error signal)

• every neuron (hidden or output) receive and compare the functioned 



error signals
• back-propagation consists in a simplification of the gradient descent 

obtained by recursively applying the chain rule of derivatives

4
BACK-PROPAGATION: OUTPUT NEURONS
• if j is the j-th output neuron, then the quantities we need to calculate for the n-th event are:

y1(n) m
X induced local field of the neuron
⇠j (n) = wji (n)yi (n)
y2(n) I=0
Φ(ξ = ∑wyi) Φ: activation function
wji(n) j yj (n) = (⇠j (n))
yi(n)
<latexit sha1_base64="q32Zwq97ofMwoEQ7KDdKCMDUTAc=">AAACT3icbZFPT9swGMadwgZ0/wocd7GoNnWHVcmEBJdKFVzgBhKFTk0XOe6b1q3tRLYDRFG/IRe48TW47ABCOCWwrd0rWXr0/N5HTh6HCWfauO6tU1lafvN2ZXWt+u79h4+fausbpzpOFYUOjXmsuiHRwJmEjmGGQzdRQETI4Syc7Bf87ByUZrE8MVkCfUGGkkWMEmOtoBb5lywYN+Q3/LWFfZ2KID9sudNfAl8E+ZhNLckCVnDfr2Z/NpMRa7xEZwxe2c8gNyqFIoq/4+dMUKu7TXc2eFF4paijco6C2o0/iGkqQBrKidY9z01MPyfKMMphWvVTDQmhEzKEnpWSCND9fNbHFH+xzgBHsbJHGjxz/07kRGididBuCmJGep4V5v9YLzXRbj9nMkkNSPp8UZRybGJclIsHTAE1PLOCUMXst2I6IopQY5+gakvw5n95UZz+aHpu0zverrf3yjpW0We0hRrIQzuojQ7QEeogiq7QHbpHD86189t5rJSrFacUm+ifqaw9Ack9rmI=</latexit>
ej (n) = Ytrue (n) yj (n) deviation

ym(n) we want to update weights in a similar way as in the SSE (sum of squared
errors) with gradient descent:

1 2 @ j (n)
j (n) = ej (n) wji = ⌘
2 @wji (n)
istant error
<latexit sha1_base64="3LvOe9bqfJ4PUvTG7evyfHh7iPw=">AAACC3icbVA9SwNBEN3zM8avqKXNkiDEJtwFQRshqIVlBPMByXnsbeaSNXt7x+6eEI70Nv4VGwtFbP0Ddv4bN8kVmvhg4PHeDDPz/JgzpW3721paXlldW89t5De3tnd2C3v7TRUlkkKDRjySbZ8o4ExAQzPNoR1LIKHPoeUPLyd+6wGkYpG41aMY3JD0BQsYJdpIXqHYvQKuiXdfFsf4HHcDSWjqjNPqGMNddSp7hZJdsafAi8TJSAllqHuFr24vokkIQlNOlOo4dqzdlEjNKIdxvpsoiAkdkj50DBUkBOWm01/G+MgoPRxE0pTQeKr+nkhJqNQo9E1nSPRAzXsT8T+vk+jgzE2ZiBMNgs4WBQnHOsKTYHCPSaCajwwhVDJzK6YDYuLQJr68CcGZf3mRNKsVx644Nyel2kUWRw4doiIqIwedohq6RnXUQBQ9omf0it6sJ+vFerc+Zq1LVjZzgP7A+vwBJwGZMg==</latexit>

<latexit sha1_base64="6SngetkXKtZ7QEWw2DrlHMElzW8=">AAACLnicbVDLSsQwFE19O75GXboJDoIuHFoRdCOID3A5gjMK01LSzK2TmTQtSaoMpV/kxl/RhaAibv0M0+mAzwuBk3POvck9QcKZ0rb9bI2NT0xOTc/MVubmFxaXqssrLRWnkkKTxjyWVwFRwJmApmaaw1UigUQBh8ugf1zolzcgFYvFhR4k4EXkWrCQUaIN5VdP3RPgmuBbP+uxHB/gbRfM1Q0loZmbEKkZ4bg0+b1NsZV/sWVPwfnVml23h4X/AmcEamhUDb/66HZimkYgNOVEqbZjJ9rLisGUQ15xUwUJoX1yDW0DBYlAedlw3RxvGKaDw1iaIzQest87MhIpNYgC44yI7qrfWkH+p7VTHe57GRNJqkHQ8qEw5VjHuMgOd5gEqvnAAEIlM3/FtEtMUtokXDEhOL9X/gtaO3XHrjvnu7XDo1EcM2gNraNN5KA9dIjOUAM1EUV36AG9oFfr3nqy3qz30jpmjXpW0Y+yPj4BOeKorA==</latexit>

5
• let’s use the chain rule of the differential calculus:

ej -1 dΦ/dξj yi
@ j (n) @ j (n) @ej (n) @yj (n) @⇠j (n)
= =
@wji (n) @ej (n) @yj (n) @⇠j (n) @wji (n)
0
<latexit sha1_base64="0/2ubjY5fnirRQdikNfiEQ3EUBU=">AAAC7HicjVJNT9tAEF0b2obQlqQ9clkR8XVoZCOk9lIJQQ8cQSIJUhxZ682YbLJeW7vrgmXlN3DhQFX12h/EjX/DOg60mBwYabVPb97TzM5skHCmtOPcW/bS8pu372or9dX3Hz6uNZqfuipOJYUOjXkszwOigDMBHc00h/NEAokCDr1gclTkez9BKhaLM50lMIjIhWAho0Qbym9athdKQnMvIVIzwrH3A7gm/nhH7E7/sZd+PmbTgsNb3/GrLFASFS1UZdlCWVaVeVdsofCJX9iradWrm4a/lGW9ZMS2DXo07WY+M5ffaDltZxb4JXDnoIXmceI37rxhTNMIhKacKNV3nUQP8qI+5TCte6mChNAJuYC+gYJEoAb5bFlTvGmYIQ5jaY7QeMb+78hJpFQWBUYZET1S1VxBLsr1Ux1+G+RMJKkGQctCYcqxjnGxeTxkEqjmmQGESmZ6xXREzDS1+R91MwS3+uSXoLvXdp22e7rfOjicj6OG1tEG2kEu+ooO0DE6QR1ELWZdW7fWL1vYN/Zv+08pta255zN6FvbfBxMM6lk=</latexit>
= ej (n) j (⇠ j (n))y i (n)
• so that:
local gradient
@ j (n)
j (n) = =
@⇠j (n)
@ j (n) @ej (n) @yj (n)
= =
<latexit sha1_base64="2xeKLgrnJoFt0kJeMtH9oDkWUSU=">AAACE3icbZDLSgMxFIYzXmu9jbp0EyxCdVFmRNCNUNSFywr2Ap0yZNLTNm0mMyQZpQx9Bze+ihsXirh14863MdN2oa0HQn6+/xyS8wcxZ0o7zre1sLi0vLKaW8uvb2xubds7uzUVJZJClUY8ko2AKOBMQFUzzaERSyBhwKEeDK4yv34PUrFI3OlhDK2QdAXrMEq0Qb597F0D1wQ/+Cnrj4riCF9gDwzw2hn3+xka+sxcvl1wSs648Lxwp6KAplXx7S+vHdEkBKEpJ0o1XSfWrZRIzSiHUd5LFMSEDkgXmkYKEoJqpeOdRvjQkDbuRNIcofGY/p5ISajUMAxMZ0h0T816GfzPaya6c95KmYgTDYJOHuokHOsIZwHhNpNANR8aQahk5q+Y9ogkVJsY8yYEd3bleVE7KblOyb09LZQvp3Hk0D46QEXkojNURjeogqqIokf0jF7Rm/VkvVjv1sekdcGazuyhP2V9/gAGx5xW</latexit>
wij (n) = ⌘ j (n)yi (n) @ej (n) @yj (n) @⇠j (n)
0
= ej (n) j (⇠j (n))
these are all available quantities … 6
<latexit sha1_base64="XDsJ/jvMqesrGnhkNFfSKdAsD1o=">AAACz3icjVJNT+MwEHXC55avAse9WFR8HagShAQXJAR74MABJApIdVU57oQaHCeynV2iULTX/Xt74w/wO3CbgCAgwUiWnt574xnPOEgE18bzHh13bHxicmr6R21mdm5+ob64dKHjVDFosVjE6iqgGgSX0DLcCLhKFNAoEHAZ3B4N9cvfoDSP5bnJEuhE9FrykDNqLNWtP5EeCEO7NxtyE6/t4y0SKspyklBlOBWY/HqVB2/YO15QeB8TUvt+IpRpFTNUfVlBVGzZF33YNoqbSNLn6xa96JvdesNreqPAH4FfggYq47Rb/096MUsjkIYJqnXb9xLTyYd1mYBBjaQaEspu6TW0LZQ0At3JR/sY4FXL9HAYK3ukwSP2bUZOI62zKLDOiJq+rmpD8jOtnZpwr5NzmaQGJCsKhanAJsbD5eIeV8CMyCygTHHbK2Z9amdo7Beo2SH41Sd/BBfbTd9r+mc7jYPDchzT6CdaQRvIR7voAB2jU9RCzDlxlJM79+6Z+8d9cP8WVtcpc5bRu3D/PQPrGdzZ</latexit>
BACK-PROPAGATION: HIDDEN NEURONS
• for an hidden neuron the situation is more complicated as we cannot directly have access
to Ytrue and then we are not able to compute ej(n)
• to cope with that ej(n) is computed recursively by using the error signals of all the forward
neurons to which the hidden neuron is connected
• example: j-th hidden neuron connected with the k-th output neuron:
y1(n)

y2(n)
yj(n) wkj(n) yk(n) ek (n) =
wji(n) j k
Ytrue (n) yk (n)
yi(n)
<latexit sha1_base64="EU+S+FdFu8u1Rc4rmXicC8LGpIU=">AAACCXicbVBNS8NAEN34WeNX1KOXxSLUgyXxoheh6MVjBfshbQmb7aRdutmE3Y1QQq9e/CtePCji1X/gzX/jJu1BWx8MPN6bYWZekHCmtOt+W0vLK6tr66UNe3Nre2fX2dtvqjiVFBo05rFsB0QBZwIammkO7UQCiQIOrWB0nfutB5CKxeJOjxPoRWQgWMgo0UbyHQz+qCJO8CXudu17P9MyhUkunOJx4fhO2a26BfAi8WakjGao+85Xtx/TNAKhKSdKdTw30b2MSM0oh4ndTRUkhI7IADqGChKB6mXFJxN8bJQ+DmNpSmhcqL8nMhIpNY4C0xkRPVTzXi7+53VSHV70MiaSVIOg00VhyrGOcR4L7jMJVPOxIYRKZm7FdEgkodqEZ5sQvPmXF0nzrOq5Ve/WLdeuZnGU0CE6QhXkoXNUQzeojhqIokf0jF7Rm/VkvVjv1se0dcmazRygP7A+fwCVlJey</latexit>
sha1_base64="4kDoRh7BCxbyqsqa4w/k5XOf2kc=">AAACCXicbVDLSgNBEOyN7/UV9ehlUARFDLte9CIEvXhUMD7ILsvspJMMmZ1dZmaFsOTqxZ/wA7x4UMSrf+DNv3HyOGhiQUNR1U13V5wJro3nfTulqemZ2bn5BXdxaXlltby2fq3TXDGssVSk6jamGgWXWDPcCLzNFNIkFngTd876/s09Ks1TeWW6GYYJbUne5IwaK0VlglFnV+6RExIE7l1UGJVjry8ckO7AicrbXsUbgEwSf0S2q1vB/hMAXETlr6CRsjxBaZigWtd9LzNhQZXhTGDPDXKNGWUd2sK6pZImqMNi8EmP7FilQZqpsiUNGai/JwqaaN1NYtuZUNPW415f/M+r56Z5HBZcZrlByYaLmrkgJiX9WEiDK2RGdC2hTHF7K2FtqigzNjzXhuCPvzxJrg8rvlfxL20apzDEPGzCFuyCD0dQhXO4gBoweIBneIU359F5cd6dj2FryRnNbMAfOJ8/qBOZOw==</latexit>
sha1_base64="39jUOYvboG9UCb9pKjIKkq9vPE0=">AAACCXicbVA9SwNBEN2LXzF+RS1tlgQhIoY7G22EoI1lBPMhuXDsbSbJkr29Y3dPOI60Nrb+DBsLRWz9B3b5N+4lFpr4YODx3gwz8/yIM6Vte2LllpZXVtfy64WNza3tneLuXlOFsaTQoCEPZdsnCjgT0NBMc2hHEkjgc2j5o6vMb92DVCwUtzqJoBuQgWB9Rok2klfE4I0q4ghfYNct3HmpljGMM+EEJ1PHK5btqj0FXiTODynXSu7x06SW1L3il9sLaRyA0JQTpTqOHeluSqRmlMO44MYKIkJHZAAdQwUJQHXT6SdjfGiUHu6H0pTQeKr+nkhJoFQS+KYzIHqo5r1M/M/rxLp/3k2ZiGINgs4W9WOOdYizWHCPSaCaJ4YQKpm5FdMhkYRqE17BhODMv7xImqdVx646NyaNSzRDHh2gEqogB52hGrpGddRAFD2gZ/SK3qxH68V6tz5mrTnrZ2Yf/YH1+Q2xfZrB</latexit>

ym(n)

7
• The local gradient for the hidden neutron j is redefined as:

@ (n)
j (n) = =
@⇠j (n)
@ (n) @yj (n) 1 X
2
=
@yj (n) @⇠j (n)
= with: (n) = ek (n)
2
@ (n) 0 <latexit sha1_base64="oYqqtupixkXkwuhIReVOqq8lGgI=">AAACF3icbVDLSgMxFM3UV62vqks3wSLUTZkpgm6EYl24rGAf0KlDJr3ThmYyQ5IRyjB/4cZfceNCEbe6829MHwttPRA4nHMuN/f4MWdK2/a3lVtZXVvfyG8WtrZ3dveK+wctFSWSQpNGPJIdnyjgTEBTM82hE0sgoc+h7Y/qE7/9AFKxSNzpcQy9kAwECxgl2kheseJeA9ekLE7xJXYDSWjqZGk1w65KQi8duUzgeobBG91XTcgrluyKPQVeJs6clNAcDa/45fYjmoQgNOVEqa5jx7qXEqkZ5ZAV3ERBTOiIDKBrqCAhqF46vSvDJ0bp4yCS5gmNp+rviZSESo1D3yRDoodq0ZuI/3ndRAcXvZSJONEg6GxRkHCsIzwpCfeZBKr52BBCJTN/xXRITDnaVFkwJTiLJy+TVrXi2BXn9qxUu5rXkUdH6BiVkYPOUQ3doAZqIooe0TN6RW/Wk/VivVsfs2jOms8coj+wPn8ALnGeCw==</latexit>
sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>
sha1_base64="IKoDKEdbUjlSuSFZ7w92ra+FNPo=">AAACDHicbZBLS8NAFIVv6rtWjW7dDIqgm5J0oxtBqAuXCrYKTQyT6U07dDIJMxOhhPwLN/4VNy4Uce3Of+P0sfB1YOBwzh1m7hfngmvjeZ9ObWFxaXllda2+3tjY3HK3G12dFYphh2UiU7cx1Si4xI7hRuBtrpCmscCbeNSe9Df3qDTP5LUZ5ximdCB5whk1NorcZnCOwtBDeUROSZAoykq/KlsVCXSRRuUo4JK0K4LR6K5lhyJ332t6U5G/xp+bfZjrMnI/gn7GihSlYYJq3fO93IQlVYYzgVU9KDTmlI3oAHvWSpqiDsvpXhU5sEmfJJmyRxoyTb/fKGmq9TiN7WRKzVD/7ibhf12vMMlJWHKZFwYlmz2UFIKYjEwgkT5XyIwYW0OZ4vavhA2phWMsyrqF4P9e+a/ptpq+1/SvPFiFXdiDQ/DhGM7gAi6hAwwe4Ale4NV5dJ6dtxmumjPntgM/5Lx/AWaJnIk=</latexit>
sha1_base64="olHNmZPp9yy4Moe45vNcVpTUd7E=">AAACF3icbVDLSsNAFJ34rPUVdelmsAh1U5JudCMU68JlBfuApobJ9KYdOpmEmYlQQv7Cjb/ixoUibnXn3zh9LLT1wMDhnHO5c0+QcKa043xbK6tr6xubha3i9s7u3r59cNhScSopNGnMY9kJiALOBDQ10xw6iQQSBRzawag+8dsPIBWLxZ0eJ9CLyECwkFGijeTbFe8auCZlcYYvsRdKQjM3z6o59lQa+dnIYwLXcwz+6L5qQr5dcirOFHiZuHNSQnM0fPvL68c0jUBoyolSXddJdC8jUjPKIS96qYKE0BEZQNdQQSJQvWx6V45PjdLHYSzNExpP1d8TGYmUGkeBSUZED9WiNxH/87qpDi96GRNJqkHQ2aIw5VjHeFIS7jMJVPOxIYRKZv6K6ZCYcrSpsmhKcBdPXiatasV1Ku6tU6pdzesooGN0gsrIReeohm5QAzURRY/oGb2iN+vJerHerY9ZdMWazxyhP7A+fwAtMZ4H</latexit>
k2C
= j (⇠j (n))
@yj (n) total istante error associated to
the output neurons (represents
<latexit sha1_base64="bOf4L2/3IjKiRehofTL3a6TZNa8=">AAACy3iclVHPS8MwGE3r7/pr6tFLcKjbwdGKoJeBqAcvgoKbg7WMNEu3aJqWJNXN2qP/oDev/iWms8KcCvpB4PHe+16S7/NjRqWy7VfDnJqemZ2bX7AWl5ZXVktr600ZJQKTBo5YJFo+koRRThqKKkZasSAo9Bm58e9Oc/3mnghJI36thjHxQtTjNKAYKU11Sm9ulzCFOrcVXoU7dbjnBgLh1I2RUBQx6J7lshazMW5AR/4M1qHrWn9tGxZNE9aC/jVep/8vPu7TXY0/Y6qdUtmu2aOC34FTgDIo6rJTenG7EU5CwhVmSMq2Y8fKS/N7MCOZ5SaSxAjfoR5pa8hRSKSXjnaRwW3NdGEQCX24giN2vCNFoZTD0NfOEKm+nNRy8ietnajgyEspjxNFOP64KEgYVBHMFwu7VBCs2FADhAXVb4W4j/TYlF6/pYfgTH75O2ju1xy75lwdlI9PinHMg02wBSrAAYfgGJyDS9AA2Dg3uPFgDMwLU5qP5tOH1TSKng3wpcznd8lP2ss=</latexit>

the error of the entire network)


• deriving wrt yj(n):

@ (n) X @ek (n) X @ek (n) @⇠k (n)


= ek (n) = ek (n)
@yj (n) @yj (n) @⇠k (n) @yj (n)
<latexit sha1_base64="/IzP3ZkEUkkzIY9U8JDPT7XarLI=">AAACr3icrZHLTsMwEEWd8CrlFWDJxqJCKhKqEjawQaqABcuC6ENqSnDcSTF1nMh2EFXU3+MD2PE3OG0R0LJkJEtXZzye8Z0w5Uxp1/2w7KXlldW10np5Y3Nre8fZ3WupJJMUmjThieyERAFnApqaaQ6dVAKJQw7tcHhV5NsvIBVLxL0epdCLyUCwiFGiDQqcNz+ShOZ+SqRmhGP/GrgmVXE8/maj4LkA+AL7KouDIYZgaACeK53Sf6jzX9kUzc/2xRdbBE7FrbmTwIvCm4kKmkUjcN79fkKzGISmnCjV9dxU9/LiWcphXPYzBSmhQzKArpGCxKB6+cTvMT4ypI+jRJojNJ7QnxU5iZUaxaG5GRP9pOZzBfwr1810dN7LmUgzDYJOG0UZxzrBxfJwn0mgmo+MIFQyMyumT8SYpM2Ky8YEb/7Li6J1WvPcmnfrVuqXMztK6AAdoiry0BmqoxvUQE1ErRPrzupavu3ZbfvBfpxeta1ZzT76FTb7BN0D0zA=</latexit>
sha1_base64="yPao8tle/EUeIhe9rc2Yb8ZY0k8=">AAACr3icrZHdSsMwFMfT+j2/pl56ExRhgozWG70Rhgp6OcW5wTprmp1ucWlaklQcZS/gg/kA3vk2putE3bz0D4E/v5OTc3JOkHCmtON8WPbc/MLi0vJKaXVtfWOzvLV9r+JUUmjQmMeyFRAFnAloaKY5tBIJJAo4NIPBRR5vPoNULBZ3ephAJyI9wUJGiTbIL795oSQ08xIiNSMce5fANamIw9E3G/pPOcBn2FNp5A8w+AMD8FRqQf8hz3thBZru7YvPlvDL+07VGQvPGndi9mtXr5fIqO6X371uTNMIhKacKNV2nUR3svxZymFU8lIFCaED0oO2sYJEoDrZeN4jfGBIF4exNEdoPKY/MzISKTWMAnMzIrqvpmM5/CvWTnV42smYSFINghaFwpRjHeN8ebjLJFDNh8YQKpnpFdM+MUPSZsUlMwR3+suz5v646jpV98ZM4xwVWka7aA9VkItOUA1dozpqIGodWbdW2/Js127aD/ZjcdW2Jjk76Jds9gmqhtSF</latexit>
sha1_base64="dhPFSImpZfcUsB/GuQ+GGVOXf2I=">AAACr3icrZFNS8QwEIbT+r1+rXr0EhRBQZbWi14EUUGPKq4rbNaaZqcaN01LkopL2T/gzT/lXW/+GTHdKuquRwcCL89kMpN3wlRwbTzvzXFHRsfGJyanKtMzs3Pz1YXFC51kikGdJSJRlyHVILiEuuFGwGWqgMahgEbYOSjyjXtQmify3HRTaMX0RvKIM2osCqrPJFKU5SSlynAqMDkEYei63Oh9s25wVwC8i4nO4qCDIehYgAdKS/oPdeSBl2hwti8+3CKorno1rx94WPifYnXv6PHgnTy9nATVV9JOWBaDNExQrZu+l5pWXjzLBPQqJNOQUtahN9C0UtIYdCvv+93Da5a0cZQoe6TBffqzIqex1t04tDdjam71YK6Af+WamYl2WjmXaWZAsrJRlAlsElwsD7e5AmZE1wrKFLezYnZLrUnGrrhiTfAHvzwsLrZqvlfzT60b+6iMSbSMVtA68tE22kPH6ATVEXM2nTOn6RDXdxvulXtdXnWdz5ol9Ctc/gFSGNdA</latexit>
k k
8
• using:
ek (n) = Ytruek (n) yk (n) = Ytruek (n) k (⇠k (n))
@ek (n) 0
⟹ = k (⇠k (n))
<latexit sha1_base64="XzQIwmvw+T+yFiVs37zahn7XVQg=">AAACdHicbVHLTsMwEHTCu7wKHDjAwVCVx4EqQUhwQUJw4QgSLaCmihx3Q606TmQ7iCrKF/B33PgMLpxxmiKgdCVLo5nxrj0bJJwp7Tjvlj01PTM7N79QWVxaXlmtrq23VJxKCk0a81g+BEQBZwKammkOD4kEEgUc7oP+VaHfP4NULBZ3epBAJyJPgoWMEm0ov/oKfv9AHOK9c/zoZ5mWKeR+Py+oIzwotUmSl/SYUb0XNvQcYs+reKEkNPMSIjUjHJed8x/i25wX08oW+797+NWa03CGhf8DdwRqaFQ3fvXN68Y0jUBoyolSbddJdCcrxlEOecVLFSSE9skTtA0UJALVyYah5bhumC4OY2mO0HjI/r6RkUipQRQYZ0R0T41rBTlJa6c6POtkTCSpBkHLQWHKsY5xsQHcZRKo5gMDCJXMvBXTHjHRabOnignBHf/yf9A6brhOw709qV1cjuKYR1toFx0gF52iC3SNblATUfRhbVrY2rE+7W27ZtdLq22N7mygP2U3vgB34LlP</latexit>
@⇠k (n)
m
X
• also: ⇠k (n) = wkj (n)yj (n)
j=0
@⇠k (n)
⟹ = wkj (n)
<latexit sha1_base64="wXyYERc1PERi+yXqwG8GjjgGRM4=">AAACSnicbVC7TsMwFHXKq5RXgJHFokIqS5UgJFgqVbAwFokCUlMix3Vat7YT2Q5QRfk+FiY2PoKFAYRYcNryKlzJ8tE599xrnyBmVGnHebQKM7Nz8wvFxdLS8srqmr2+ca6iRGLSxBGL5GWAFGFUkKammpHLWBLEA0YugsFxrl9cE6loJM70MCZtjrqChhQjbSjfRt4t9QcVsQtr0FMJ99N+zcmuOLzx00E/y4Wh388vzyt5oUQ49WIkNUUMflqzb2rcm5lhX37fLjtVZ1TwL3AnoAwm1fDtB68T4YQToTFDSrVcJ9btNN+AGclKXqJIjPAAdUnLQIE4Ue10FEUGdwzTgWEkzREajtifjhRxpYY8MJ0c6Z6a1nLyP62V6PCwnVIRJ5oIPF4UJgzqCOa5wg6VBGs2NABhSc1bIe4hk5c26ZdMCO70l/+C872q61Td0/1y/WgSRxFsgW1QAS44AHVwAhqgCTC4A0/gBbxa99az9Wa9j1sL1sSzCX5VYfYDPWWxSA==</latexit>
@yj (n)
• then:

@ (n) X X
0
= ek (n) k (⇠k (n))wkj (n) = k (n)wkj (n)
@yj (n)
k k
9
<latexit sha1_base64="Hh+sb2UnQJsh876DaQDA3ZkSbpM=">AAACX3icbZFLS8QwFIXT+h5fVVfiJjiIunBoRdCNIOrCpYKjwmQoaebWyTRNS5KqQ5k/6U5w4z8xrYPvC4HDd26Sm5MoF1wb339x3InJqemZ2bnG/MLi0rK3snqjs0IxaLNMZOouohoEl9A23Ai4yxXQNBJwGyVnlX/7AErzTF6bYQ7dlN5LHnNGjUWh90BiRVlJcqoMpwKTcxCG7sjd0RcbhoMK4GO8R3SRhgmGMLEEk7zPt60kT7wGu49hmQxGlXWMP5tJrzqybvj0Q6/pt/y68F8RjEUTjesy9J5JL2NFCtIwQbXuBH5uumU1IRMwapBCQ05ZQu+hY6WkKehuWeczwluW9HCcKbukwTX9vqOkqdbDNLKdKTV9/dur4H9epzDxUbfkMi8MSPZxUVwIbDJchY17XAEzYmgFZYrbWTHrUxu4sV/SsCEEv5/8V9zstwK/FVwdNE9Ox3HMog20iXZQgA7RCbpAl6iNGHp1XGfeWXDe3Bl3yfU+Wl1nvGcN/Sh3/R1DrrMe</latexit>
• obtaining for the back-propagation of the local gradient of the j-th hidden
neuron:

X
0
j (n) = j (⇠ j (n)) k (n)wkj (n)
<latexit sha1_base64="0JK6K9oRU/09hacDkRrAuRkguQQ=">AAACI3icbVDNS8MwHE39nPOr6tFLcIjbZbQiKIIw9OJxgvuAtZQ0y7asaVqSVB1l/4sX/xUvHpThxYP/i1nXg24+CLy8934kv+fHjEplWV/G0vLK6tp6YaO4ubW9s2vu7TdllAhMGjhikWj7SBJGOWkoqhhpx4Kg0Gek5Qc3U7/1QISkEb9Xo5i4Iepz2qMYKS155qXTJUwhb1jmFXgFnXhAT/TFeaKZVHFkEnoBzFOBlh69NBiONfHMklW1MsBFYuekBHLUPXPidCOchIQrzJCUHduKlZsioShmZFx0EklihAPUJx1NOQqJdNNsxzE81koX9iKhD1cwU39PpCiUchT6OhkiNZDz3lT8z+skqnfhppTHiSIczx7qJQyqCE4Lg10qCFZspAnCguq/QjxAAmGlay3qEuz5lRdJ87RqW1X77qxUu87rKIBDcATKwAbnoAZuQR00AAbP4BW8gw/jxXgzJsbnLLpk5DMH4A+M7x9rBKLp</latexit>
k
sum over all forward neurons to which j is connected

0
• if the k neurons are output neurons:
<latexit sha1_base64="4q7r9ngKaMV26Xz5bdrrMHqbOlk=">AAACDnicbZDLSgMxFIYz9VbrbdSlm2AptpsyI4JuhKIblxXsBTqlZDJn2tBMZkgyYil9Aje+ihsXirh17c63MZ12oa0/BL785xyS8/sJZ0o7zreVW1ldW9/Ibxa2tnd29+z9g6aKU0mhQWMey7ZPFHAmoKGZ5tBOJJDI59Dyh9fTeusepGKxuNOjBLoR6QsWMkq0sXp2yQuAa9IblkUFX2LIwEsG7MSQ98Cye6VnF52qkwkvgzuHIpqr3rO/vCCmaQRCU06U6rhOortjIjWjHCYFL1WQEDokfegYFCQC1R1n60xwyTgBDmNpjtA4c39PjEmk1CjyTWdE9EAt1qbmf7VOqsOL7piJJNUg6OyhMOVYx3iaDQ6YBKr5yAChkpm/YjogklBtEiyYENzFlZeheVp1nap7e1asXc3jyKMjdIzKyEXnqIZuUB01EEWP6Bm9ojfryXqx3q2PWWvOms8coj+yPn8A7Z6aHA==</latexit>
k (n) = ek (n) k (⇠ k (n))
X
0
• if the k neurons are hidden neurons: k (n) = k (⇠ k (n)) l (n)wlk (n)
<latexit sha1_base64="sxOWDGUxclvneeUpCNHJRyeQOkc=">AAACI3icbVDNS8MwHE39nPOr6tFLcIjbZbQiKIIw9OJxgvuAtZQ0TbewNC1Jqo6y/8WL/4oXD8rw4sH/xazrQTcfBF7eez+S3/MTRqWyrC9jaXlldW29tFHe3Nre2TX39tsyTgUmLRyzWHR9JAmjnLQUVYx0E0FQ5DPS8Yc3U7/zQISkMb9Xo4S4EepzGlKMlJY889IJCFPIG1Z5DV5BJxnQE31xnmgu1RyZRh6DRYpp6dHL2HCsiWdWrLqVAy4SuyAVUKDpmRMniHEaEa4wQ1L2bCtRboaEopiRcdlJJUkQHqI+6WnKUUSkm+U7juGxVgIYxkIfrmCu/p7IUCTlKPJ1MkJqIOe9qfif10tVeOFmlCepIhzPHgpTBlUMp4XBgAqCFRtpgrCg+q8QD5BAWOlay7oEe37lRdI+rdtW3b47qzSuizpK4BAcgSqwwTlogFvQBC2AwTN4Be/gw3gx3oyJ8TmLLhnFzAH4A+P7B3ZAovA=</latexit>
l
10
UPDATING WEIGHTS PROCEDURE

• step 1: take a batch of training data


• step 2: perform forward propagation at fixed weights to compute the loss
• step 3: back-propagate the loss to compute the gradient of the loss wrt each weight
• step 4: use the gradients to update the weights of the network
XN XN X
1 1
< (n) >= (n) = e2k (n) batch & mini-batch training
N n=1 2N n=1
k2C

1X 2
(n) = ek (n) online training (weights updated after each event)
2
<latexit sha1_base64="oYqqtupixkXkwuhIReVOqq8lGgI=">AAACF3icbVDLSgMxFM3UV62vqks3wSLUTZkpgm6EYl24rGAf0KlDJr3ThmYyQ5IRyjB/4cZfceNCEbe6829MHwttPRA4nHMuN/f4MWdK2/a3lVtZXVvfyG8WtrZ3dveK+wctFSWSQpNGPJIdnyjgTEBTM82hE0sgoc+h7Y/qE7/9AFKxSNzpcQy9kAwECxgl2kheseJeA9ekLE7xJXYDSWjqZGk1w65KQi8duUzgeobBG91XTcgrluyKPQVeJs6clNAcDa/45fYjmoQgNOVEqa5jx7qXEqkZ5ZAV3ERBTOiIDKBrqCAhqF46vSvDJ0bp4yCS5gmNp+rviZSESo1D3yRDoodq0ZuI/3ndRAcXvZSJONEg6GxRkHCsIzwpCfeZBKr52BBCJTN/xXRITDnaVFkwJTiLJy+TVrXi2BXn9qxUu5rXkUdH6BiVkYPOUQ3doAZqIooe0TN6RW/Wk/VivVsfs2jOms8coj+wPn8ALnGeCw==</latexit>
sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>
sha1_base64="IKoDKEdbUjlSuSFZ7w92ra+FNPo=">AAACDHicbZBLS8NAFIVv6rtWjW7dDIqgm5J0oxtBqAuXCrYKTQyT6U07dDIJMxOhhPwLN/4VNy4Uce3Of+P0sfB1YOBwzh1m7hfngmvjeZ9ObWFxaXllda2+3tjY3HK3G12dFYphh2UiU7cx1Si4xI7hRuBtrpCmscCbeNSe9Df3qDTP5LUZ5ximdCB5whk1NorcZnCOwtBDeUROSZAoykq/KlsVCXSRRuUo4JK0K4LR6K5lhyJ332t6U5G/xp+bfZjrMnI/gn7GihSlYYJq3fO93IQlVYYzgVU9KDTmlI3oAHvWSpqiDsvpXhU5sEmfJJmyRxoyTb/fKGmq9TiN7WRKzVD/7ibhf12vMMlJWHKZFwYlmz2UFIKYjEwgkT5XyIwYW0OZ4vavhA2phWMsyrqF4P9e+a/ptpq+1/SvPFiFXdiDQ/DhGM7gAi6hAwwe4Ale4NV5dJ6dtxmumjPntgM/5Lx/AWaJnIk=</latexit>
sha1_base64="olHNmZPp9yy4Moe45vNcVpTUd7E=">AAACF3icbVDLSsNAFJ34rPUVdelmsAh1U5JudCMU68JlBfuApobJ9KYdOpmEmYlQQv7Cjb/ixoUibnXn3zh9LLT1wMDhnHO5c0+QcKa043xbK6tr6xubha3i9s7u3r59cNhScSopNGnMY9kJiALOBDQ10xw6iQQSBRzawag+8dsPIBWLxZ0eJ9CLyECwkFGijeTbFe8auCZlcYYvsRdKQjM3z6o59lQa+dnIYwLXcwz+6L5qQr5dcirOFHiZuHNSQnM0fPvL68c0jUBoyolSXddJdC8jUjPKIS96qYKE0BEZQNdQQSJQvWx6V45PjdLHYSzNExpP1d8TGYmUGkeBSUZED9WiNxH/87qpDi96GRNJqkHQ2aIw5VjHeFIS7jMJVPOxIYRKZv6K6ZCYcrSpsmhKcBdPXiatasV1Ku6tU6pdzesooGN0gsrIReeohm5QAzURRY/oGb2iN+vJerHerY9ZdMWazxyhP7A+fwAtMZ4H</latexit>
k2C

11
TRAINING ONLINE VS BATCH

12
CRITICAL ISSUES IN A NEURAL NETWORK
• training speed:
• several techniques to speedup convergence (momentum, adaptive learning
rates, ad-hoc activation functions, …
ADAPTIVE LEARNIGN RATE

ADA grad: adapt the learning rate individually for each parameter:
sgd: stochastic gradient descent ‣ parameters associated to frequently occurring features: reduce η
‣ parameters associated to infrequently occurring features:: increase η

RMSProp/Adadelta: variation over ADA grad in order to mitigate the too aggressive behaviour in
monotonically reducing the learning rate. con il fine di diminuire l’approccio troppo aggressivo di ADA grad
nel diminuire monotonamente il learning rate … restricts the window of accumulated past gradients

Adam: similar to RMSProp + momentum

• but most of all: dedicated processors 



(GPUs, FPGAs, TPUs (ASICS), …)
13
CRITICAL ISSUES IN A NEURAL NETWORK
• hardcore overfitting: immediately appears as soon as the network layout grows and
the number of weights increase
• related to the variance-bias tradeoff (generalisation)
• no free lunch theorem: averaged over all possible data-generating distributions,
every classification algorithm has the same error rate when classifying previously
unobserved points → the most beautiful algorithm we can conceive of has the
same average performance (over all possible tasks) as a random classifier

• solution:
• apply a set of methods aligned with the specific task that we ask the algorithm
to solve in order to make it to perform better
• regularisation techniques: most used to control overfitting in NNs
14
REGULARIZATION PROCEDURE IN A NN
• aims at avoiding the model to overfit the data and thus deals with high variance issues

• several strategies available, often used in conjunction, designed to reduce the test error,
possibly at the expense of increased training error
• L1/L2/L1+L2 regularisation: impose restrictions on the parameter values
• Dropout: impose restrictions on the complexity of the model (expressive power of the NN)
• Early stopping: impose restrictions on the reduction of the training error
• Mini-batch training: impose additional statistical fluctuations on loss function landscape
• Noise injection: impose additional statistical fluctuations on the features
• Data augmentation: better generalisation by training on larger data-sets
• …

15
DROPOUT
• very popular and powerful technique used in neural networks to prevent
overfitting the training data by dropping out neurons with a selected probability
• it forces the model to avoid relying too much on particular sets of features

before after

used routinely in the context of ConvNETs where has been shown to be able to
drastically improve test-set performances 16
EARLY STOPPING AND NOISE INJECTION
• early stopping: a regularization technique in which sthe training process
Minimizzazione del Rischio is
stopped as soon as the validation loss reaches aStrutturale
plateau or starts to increase

R(f)

Rs(f)

F. Tortorella © 2005
Teoria e Tecniche di Pattern Recognition
Università degli Studi
Support Vector Machines 13 di Cassino

• noise addition/information loss: makes more hard


to relay on specific features of the training set

17
DATA AUGMENTATION
• best way to make a machine learning model generalize better is to train it on more data
• more data is usually the real issue → solution: artificially increase the size of the
training set

+ novel approaches (generative-NN: GAN, VAE, …)


18
L1/L2/L3 REGULARISATION
• idea: constraint the size of the model by penalizing high weights unless strongly requested
by the data itself
• L(w) → L(w) + penalty(w)

L1 weight decay L2 L1+L2

θ2

θ1

19
EXAMPLE: L2 REG. FOR LINEAR REGRESSION

L2

∝ to covariance matrix

L2 replace X⊤X with X⊤X+αI → the learning algorithm see the input X as having higher variance →
shrink the weights on features whose covariance with the output target is low compared to this added
variance

example

20
MNIST DATASET CLASSIFICATION WITH DENSE NN IMPLEMENTED IN KERAS/TENSORFLOW
WITH HYPER-PARAMETER TUNING
https://www.dropbox.com/s/vhc02rd8h5lpj3l/FFNN_3_layers.ipynb?dl=0

21
Play with:

• # hidden layers
• # neurons per layer
• increase expressive power

• dropout
• increase capacity

• learning rate
• activation functions
• optimizer
• # epochs
• batch size
• weights initialisation
• optimise training speed
and convergence
22
ML HYPERPARAMETER TUNING RECIPE
based on analysis of bias/variance tradeoff in the error rate

23
# OF LAYERS AND ARCHITECTURE OF THE NETWORK

NOTE: multilayered dense NN may tabe some advantages wrt shallow networks as there are
examples of functions that small deep NN are able to approximate more efficiently than
shallow NN with many neurons
Caveat: with dense net is very hard to go above 2-3 hidden layers, due to the dilution of
the training power (will be discussed next lesson when we’ll treat deep-learning (≠ deep-
networks)

Overfitting small batch: when defining/debugging a model, it is useful in order to make


sure that the model can be properly trained, to pass a mini-batch inside the network to
see if it can overfit on it. If it cannot, it means that the model is either too complex or
not complex enough to even overfit on a small batch, let alone a normal-sized training
set

24
ACTIVATION FUNCTION
• in principle any differentiable function may work
• however: there are functions that works better than others, in particular for deep-archive cures
• useful properties for shallow networks:
• not linear: to be able to approximate non-linear boundaries
• saturating (i.e. with limited min and max values): so that weight values remains constrained and convergence time improves
• monotone: to avoid spurious local minima
• linear for small input values, to support linear behaviour when needed
• anti-symmetric: for faster learning

shallow-NN deep-NN 25
# OF HIDDEN NEURONS AND WEIGHT INITIALISATION

• # neurons for input nodes: input feature vector dimension


• # output nodes: # of classes, or dimension of the output function in multivariate regression
• # hidden neurons?
• determines the expressive power of the network
• the choice of the number of hidden neurons it is not a problem formally solved in information theory
• empirically optimised on the validation sample + practical rules based on experience
• example: shallow networks: if N events in the training set then #hidden neurons = N/10

• the initial value of the weights should never be set to zero! The symmetry of the network must be broken …
• Typical rules:
• Usually: random weights different from zero
• Xavier initialization: enables to have initial weights that take into account characteristics that are unique to
the architecture: weights should be initialised so that the variation of the local indice field ξ of the neuron is
in the transition region between the linear and saturation region of the activation function

26
NN MODEL DEFINITION
reLu to improve training speed
and convergence in deep-NN

dropout to avoid overfitting

altrenatively:
- L1/L2/L1+L2 regularisation

27
SGD: i.e. GD with mini-batch
RMSprop: GS with adaptive LR
Adam: combo SGD +
momentum + RMSprop
Adam typically performs better
out of the box …

28
~800K parameters … easy to reach O(10 M)

29
….

30
callback to implement decaying LR

31
….

32
….

33
Working example: Tuning della rete

from 91% FFNN shallow to 98% accuracy on the test sample …

$34
TEST DIFFERENT TUNINGS

35

Вам также может понравиться